banner-in1

  • Programming

Top 10 Software Engineer Research Topics for 2024

Home Blog Programming Top 10 Software Engineer Research Topics for 2024

Play icon

Software engineering, in general, is a dynamic and rapidly changing field that demands a thorough understanding of concepts related to programming, computer science, and mathematics. As software systems become more complicated in the future, software developers must stay updated on industry innovations and the latest trends. Working on software engineering research topics is an important part of staying relevant in the field of software engineering. 

Software engineers can do research to learn about new technologies, approaches, and strategies for developing and maintaining complex software systems. Software engineers can conduct research on a wide range of topics. Software engineering research is also vital for increasing the functionality, security, and dependability of software systems. Going for the Top Programming Certification course contributes to the advancement of the field's state of the art and assures that software engineers can continue to build high-quality, effective software systems.

What are Software Engineer Research Topics?

Software engineer research topics are areas of exploration and study in the rapidly evolving field of software engineering. These research topics include various software development approaches, quality of software, testing of software, maintenance of software, security measures for software, machine learning models in software engineering, DevOps, and architecture of software. Each of these software engineer research topics has distinct problems and opportunities for software engineers to investigate and make major contributions to the field. In short, research topics for software engineering provide possibilities for software engineers to investigate new technologies, approaches, and strategies for developing and managing complex software systems. 

For example, research on agile software development could identify the benefits and drawbacks of using agile methodology, as well as develop new techniques for effectively implementing agile practices. Software testing research may explore new testing procedures and tools, as well as assess the efficacy of existing ones. Software quality research may investigate the elements that influence software quality and develop approaches for enhancing software system quality and minimizing the faults and errors. Software metrics are quantitative measures that are used to assess the quality, maintainability, and performance of software. 

The research papers on software engineering topics in this specific area could identify novel measures for evaluating software systems or techniques for using metrics to improve the quality of software. The practice of integrating code changes into a common repository and pushing code changes to production in small, periodic batches is known as continuous integration and deployment (CI/CD). This research could investigate the best practices for establishing CI/CD or developing tools and approaches for automating the entire CI/CD process.

Top Software Engineer Research Topics

In this article we will be going through the following Software Engineer Research Topics:

1. Artificial Intelligence and Software Engineering

Intersections between AI and SE

The creation of AI-powered software engineering tools is one potential research area at the intersection of artificial intelligence (AI) and software engineering. These technologies use AI techniques that include machine learning, natural language processing, and computer vision to help software engineers with a variety of tasks throughout the software development lifecycle. An AI-powered code review tool, for example, may automatically discover potential flaws or security vulnerabilities in code, saving developers a lot of time and lowering the chance of human error. Similarly, an AI-powered testing tool might build test cases and analyze test results automatically to discover areas for improvement. 

Furthermore, AI-powered project management tools may aid in the planning and scheduling of projects, resource allocation, and risk management in the project. AI can also be utilized in software maintenance duties such as automatically discovering and correcting defects or providing code refactoring solutions. However, the development of such tools presents significant technical and ethical challenges, such as the necessity of large amounts of high-quality data, the risk of bias present in AI algorithms, and the possibility of AI replacing human jobs. Continuous study in this area is therefore required to ensure that AI-powered software engineering tools are successful, fair, and responsible.

Knowledge-based Software Engineering

Another study area that overlaps with AI and software engineering is knowledge-based software engineering (KBSE). KBSE entails creating software systems capable of reasoning about knowledge and applying that knowledge to enhance software development processes. The development of knowledge-based systems that can help software engineers in detecting and addressing complicated problems is one example of KBSE in action. To capture domain-specific knowledge, these systems use knowledge representation techniques such as ontologies, and reasoning algorithms such as logic programming or rule-based systems to derive new knowledge from already existing data. 

KBSE can be utilized in the context of AI and software engineering to create intelligent systems capable of learning from past experiences and applying that information to improvise future software development processes. A KBSE system, for example, may be used to generate code based on previous code samples or to recommend code snippets depending on the requirements of a project. Furthermore, KBSE systems could be used to improve the precision and efficiency of software testing and debugging by identifying and prioritizing bugs using knowledge-based techniques. As a result, continued research in this area is critical to ensuring that AI-powered software engineering tools are productive, fair, and responsible.

2. Natural Language Processing

Multimodality

Multimodality in Natural Language Processing (NLP) is one of the appealing research ideas for software engineering at the nexus of computer vision, speech recognition, and NLP. The ability of machines to comprehend and generate language from many modalities, such as text, speech, pictures, and video, is referred to as multimodal NLP. The goal of multimodal NLP is to develop systems that can learn from and interpret human communication across several modalities, allowing them to engage with humans in more organic and intuitive ways. 

The building of conversational agents or chatbots that can understand and create responses using several modalities is one example of multimodal NLP in action. These agents can analyze text input, voice input, and visual clues to provide more precise and relevant responses, allowing users to have a more natural and seamless conversational experience. Furthermore, multimodal NLP can be used to enhance language translation systems, allowing them to more accurately and effectively translate text, speech, and visual content.

The development of multimodal NLP systems must take efficiency into account. as multimodal NLP systems require significant computing power to process and integrate information from multiple modalities, optimizing their efficiency is critical to ensuring that they can operate in real-time and provide users with accurate and timely responses. Developing algorithms that can efficiently evaluate and integrate input from several modalities is one method for improving the efficiency of multimodal NLP systems. 

Overall, efficiency is a critical factor in the design of multimodal NLP systems. Researchers can increase the speed, precision, and scalability of these systems by inventing efficient algorithms, pre-processing approaches, and hardware architectures, allowing them to run successfully and offer real-time replies to consumers. Software Engineering training will help you level up your career and gear up to land you a job in the top product companies as a skilled Software Engineer. 

3. Applications of Data Mining in Software Engineering

Mining Software Engineering Data

The mining of software engineering data is one of the significant research paper topics for software engineering, involving the application of data mining techniques to extract insights from enormous datasets that are generated during software development processes. The purpose of mining software engineering data is to uncover patterns, trends, and various relationships that can inform software development practices, increase software product quality, and improve software development process efficiency. 

Mining software engineering data, despite its potential benefits, has various obstacles, including the quality of data, scalability, and privacy of data. Continuous research in this area is required to develop more effective data mining techniques and tools, as well as methods for ensuring data privacy and security, to address these challenges. By tackling these issues, mining software engineering data can continue to promote many positive aspects in software development practices and the overall quality of product.

Clustering and Text Mining

Clustering is a data mining approach that is used to group comparable items or data points based on their features or characteristics. Clustering can be used to detect patterns and correlations between different components of software, such as classes, methods, and modules, in the context of software engineering data. 

On the other hand, text mining is a method of data mining that is used to extract valuable information from unstructured text data such as software manuals, code comments, and bug reports. Text mining can be applied in the context of software engineering data to find patterns and trends in software development processes

4. Data Modeling

Data modeling is an important area of research paper topics in software engineering study, especially in the context of the design of databases and their management. It involves developing a conceptual model of the data that a system will need to store, organize, and manage, as well as establishing the relationships between various data pieces. One important goal of data modeling in software engineering research is to make sure that the database schema precisely matches the system's and its users' requirements. Working closely with stakeholders to understand their needs and identify the data items that are most essential to them is necessary.

5. Verification and Validation

Verification and validation are significant research project ideas for software engineering research because they help us to ensure that software systems are correctly built and suit the needs of their users. While most of the time, these terms are frequently used interchangeably, they refer to distinct stages of the software development process. The process of ensuring that a software system fits its specifications and needs is referred to as verification. This involves testing the system to confirm that it behaves as planned and satisfies the functional and performance specifications. In contrast, validation is the process of ensuring that a software system fulfils the needs of its users and stakeholders. 

This includes ensuring that the system serves its intended function and meets the requirements of its users. Verification and validation are key components of the software development process in software engineering research. Researchers can help to improve the functionality and dependability of software systems, minimize the chance of faults and mistakes, and ultimately develop better software products for their consumers by verifying that software systems are designed correctly and that they satisfy the needs of their users.

6. Software Project Management

Software project management is an important component of software engineering research because it comprises the planning, organization, and control of resources and activities to guarantee that software projects are finished on time, within budget, and to the needed quality standards. One of the key purposes of software project management in research is to guarantee that the project's stakeholders, such as users, clients, and sponsors, are satisfied with their needs. This includes defining the project's requirements, scope, and goals, as well as identifying potential risks and restrictions to the project's success.

7. Software Quality

The quality of a software product is defined as how well it fits in with its criteria, how well it performs its intended functions, and meets the needs of its consumers. It includes features such as dependability, usability, maintainability, effectiveness, and security, among others. Software quality is a prominent and essential research topic in software engineering. Researchers are working to provide methodologies, strategies, and tools for evaluating and improving software quality, as well as forecasting and preventing software faults and defects. Overall, software quality research is a large and interdisciplinary field that combines computer science, engineering, and statistics. Its mission is to increase the reliability, accessibility, and overall quality of software products and systems, thereby benefiting both software developers and end consumers.

8. Ontology

Ontology is a formal specification of a conception of a domain used in computer science to allow knowledge sharing and reuse. Ontology is a popular and essential area of study in the context of software engineering research. The construction of ontologies for specific domains or application areas could be a research topic in ontology for software engineering. For example, a researcher may create an ontology for the field of e-commerce to give common knowledge and terminology to software developers as well as stakeholders in that domain. The integration of several ontologies is another intriguing study topic in ontology for software engineering. As the number of ontologies generated for various domains and applications grows, there is an increasing need to integrate them in order to enable interoperability and reuse.

9. Software Models

In general, a software model acts as an abstract representation of a software system or its components. Software models can be used to help software developers, different stakeholders, and users communicate more effectively, as well as to properly evaluate, design, test, and maintain software systems. The development and evaluation of modeling languages and notations is one research example connected to software models. Researchers, for example, may evaluate the usefulness and efficiency of various modeling languages, such as UML or BPMN, for various software development activities or domains. 

Researchers could also look into using software models for software testing and verification. They may investigate how models might be used to produce test cases or to do model checking, a formal technique for ensuring the correctness of software systems. They may also examine the use of models for monitoring at runtime and software system adaptation.

The Software Development Life Cycle (SDLC) is a software engineering process for planning, designing, developing, testing, and deploying software systems. SDLC is an important research issue in software engineering since it is used to manage software projects and ensure the quality of the resultant software products by software developers and project managers. The development and evaluation of novel software development processes is one SDLC-related research topic. SDLC research also includes the creation and evaluation of different software project management tools and practices. 

SDLC

Researchers may also check the implementation of SDLC in specific sectors or applications. They may, for example, investigate the use of SDLC in the development of systems that are more safety-critical, such as medical equipment or aviation systems, and develop new processes or tools to ensure the safety and reliability of these systems. They may also look into using SDLC to design software systems in new sectors like the Internet of Things or in blockchain technology.

Why is Software Engineering Required?

Software engineering is necessary because it gives a systematic way to developing, designing, and maintaining reliable, efficient, and scalable software. As software systems have become more complicated over time, software engineering has become a vital discipline to ensure that software is produced in a way that is fully compatible with end-user needs, reliable, and long-term maintainable.

When the cost of software development is considered, software engineering becomes even more important. Without a disciplined strategy, developing software can result in overinflated costs, delays, and a higher probability of errors that require costly adjustments later. Furthermore, software engineering can help reduce the long-term maintenance costs that occur by ensuring that software is designed to be easy to maintain and modify. This can save money in the long run by lowering the number of resources and time needed to make software changes as needed.

2. Scalability

Scalability is an essential factor in software development, especially for programs that have to manage enormous amounts of data or an increasing number of users. Software engineering provides a foundation for creating scalable software that can evolve over time. The capacity to deploy software to diverse contexts, such as cloud-based platforms or distributed systems, is another facet of scalability. Software engineering can assist in ensuring that software is built to be readily deployed and adjusted for various environments, resulting in increased flexibility and scalability.

3. Large Software

Developers can break down huge software systems into smaller, simpler parts using software engineering concepts, making the whole system easier to maintain. This can help to reduce the software's complexity and makes it easier to maintain the system over time. Furthermore, software engineering can aid in the development of large software systems in a modular fashion, with each module doing a specific function or set of functions. This makes it easier to push new features or functionality to the product without causing disruptions to the existing codebase.

4. Dynamic Nature

Developers can utilize software engineering techniques to create dynamic content that is modular and easily modifiable when user requirements change. This can enable adding new features or functionality to dynamic content easier without disturbing the existing codebase. Another factor to consider for dynamic content is security. Software engineering can assist in ensuring that dynamic content is generated in a secure manner that protects user data and information.

5. Better Quality Management

An organized method of quality management in software development is provided by software engineering. Developers may ensure that software is conceived, produced, and maintained in a way that fulfills quality requirements and provides value to users by adhering to software engineering principles. Requirement management is one component of quality management in software engineering. Testing and validation are another part of quality control in software engineering. Developers may verify that their software satisfies its requirements and is error-free by using an organized approach to testing.

In conclusion, the subject of software engineering provides a diverse set of research topics with the ability to progress the discipline while enhancing software development and maintenance procedures. This article has dived deep into various research topics in software engineering for masters and research topics for software engineering students such as software testing and validation, software security, artificial intelligence, Natural Language Processing, software project management, machine learning, Data Mining, etc. as research subjects. Software engineering researchers have an interesting chance to explore these and other research subjects and contribute to the development of creative solutions that can improve software quality, dependability, security, and scalability. 

Researchers may make important contributions to the area of software engineering and help tackle some of the most serious difficulties confronting software development and maintenance by staying updated with the latest research trends and technologies. As software grows more important in business and daily life, there is a greater demand for current research topics in software engineering into new software engineering processes and techniques. Software engineering researchers can assist in shaping the future of software creation and maintenance through their research, ensuring that software stays dependable, safe, reliable and efficient in an ever-changing technological context. KnowledgeHut’s top Programming certification course will help you leverage online programming courses from expert trainers.

Frequently Asked Questions (FAQs)

 To find a research topic in software engineering, you can review recent papers and conference proceedings, talk to different experts in the field, and evaluate your own interests and experience. You can use a combination of these approaches. 

You should study software development processes, various programming languages and their frameworks, software testing and quality assurance, software architecture, various design patterns that are currently being used, and software project management as a software engineering student. 

Empirical research, experimental research, surveys, case studies, and literature reviews are all types of research in software engineering. Each sort of study has advantages and disadvantages, and the research method chosen is determined by the research objective, resources, and available data. 

Profile

Eshaan Pandey

Eshaan is a Full Stack web developer skilled in MERN stack. He is a quick learner and has the ability to adapt quickly with respect to projects and technologies assigned to him. He has also worked previously on UI/UX web projects and delivered successfully. Eshaan has worked as an SDE Intern at Frazor for a span of 2 months. He has also worked as a Technical Blog Writer at KnowledgeHut upGrad writing articles on various technical topics.

Avail your free 1:1 mentorship session.

Something went wrong

Upcoming Programming Batches & Dates

Course advisor icon

Software Engineering’s Top Topics, Trends, and Researchers

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2024 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

  • Google Meet
  • Mobile Dialer

best research topics software engineering

Resent Search

image

Management Assignment Writing

image

Technical Assignment Writing

image

Finance Assignment Writing

image

Medical Nursing Writing

image

Resume Writing

image

Civil engineering writing

image

Mathematics and Statistics Projects

image

CV Writing Service

image

Essay Writing Service

image

Online Dissertation Help

image

Thesis Writing Help

image

RESEARCH PAPER WRITING SERVICE

image

Case Study Writing Service

image

Electrical Engineering Assignment Help

image

IT Assignment Help

image

Mechanical Engineering Assignment Help

image

Homework Writing Help

image

Science Assignment Writing

image

Arts Architecture Assignment Help

image

Chemical Engineering Assignment Help

image

Computer Network Assignment Help

image

Arts Assignment Help

image

Coursework Writing Help

image

Custom Paper Writing Services

image

Personal Statement Writing

image

Biotechnology Assignment Help

image

C Programming Assignment Help

image

MBA Assignment Help

image

English Essay Writing

image

MATLAB Assignment Help

image

Narrative Writing Help

image

Report Writing Help

image

Get Top Quality Assignment Assistance

image

Online Exam Help

image

Macroeconomics Homework Help

image

Change Management Assignment Help

image

Operation management Assignment Help

image

Strategy Assignment Help

image

Human Resource Management Assignment Help

image

Psychology Assignment Writing Help

image

Algebra Homework Help

image

Best Assignment Writing Tips

image

Statistics Homework Help

image

CDR Writing Services

image

TAFE Assignment Help

image

Auditing Assignment Help

image

Literature Essay Help

image

Online University Assignment Writing

image

Economics Assignment Help

image

Programming Language Assignment Help

image

Political Science Assignment Help

image

Marketing Assignment Help

image

Project Management Assignment Help

image

Geography Assignment Help

image

Do My Assignment For Me

image

Business Ethics Assignment Help

image

Pricing Strategy Assignment Help

image

The Best Taxation Assignment Help

image

Finance Planning Assignment Help

image

Solve My Accounting Paper Online

image

Market Analysis Assignment

image

4p Marketing Assignment Help

image

Corporate Strategy Assignment Help

image

Project Risk Management Assignment Help

image

Environmental Law Assignment Help

image

History Assignment Help

image

Geometry Assignment Help

image

Physics Assignment Help

image

Clinical Reasoning Cycle

image

Forex Assignment Help

image

Python Assignment Help

image

Behavioural Finance Assignment Help

image

PHP Assignment Help

image

Social Science Assignment Help

image

Capital Budgeting Assignment Help

image

Trigonometry Assignment Help

image

Java Programming Assignment Help

image

Corporate Finance Planning Help

image

Sports Science Assignment Help

image

Accounting For Financial Statements Assignment Help

image

Robotics Assignment Help

image

Cost Accounting Assignment Help

image

Business Accounting Assignment Help

image

Activity Based Accounting Assignment Help

image

Econometrics Assignment Help

image

Managerial Accounting Assignment Help

image

R Studio Assignment Help

image

Cookery Assignment Help

image

Solidworks assignment Help

image

UML Diagram Assignment Help

image

Data Flow Diagram Assignment Help

image

Employment Law Assignment Help

image

Calculus Assignment Help

image

Arithmetic Assignment Help

image

Write My Assignment

image

Business Intelligence Assignment Help

image

Database Assignment Help

image

Fluid Mechanics Assignment Help

image

Web Design Assignment Help

image

Student Assignment Help

image

Online CPM Homework Help

image

Chemistry Assignment Help

image

Biology Assignment Help

image

Corporate Governance Law Assignment Help

image

Auto CAD Assignment Help

image

Public Relations Assignment Help

image

Bioinformatics Assignment Help

image

Engineering Assignment Help

image

Computer Science Assignment Help

image

C++ Programming Assignment Help

image

Aerospace Engineering Assignment Help

image

Agroecology Assignment Help

image

Finance Assignment Help

image

Conflict Management Assignment Help

image

Paleontology Assignment Help

image

Commercial Law Assignment Help

image

Criminal Law Assignment Help

image

Anthropology Assignment Help

image

Biochemistry Assignment Help

image

Get the best cheap assignment Help

image

Online Pharmacology Course Help

image

Urgent Assignment Help

image

Paying For Assignment Help

image

HND Assignment Help

image

Legitimate Essay Writing Help

image

Best Online Proofreading Services

image

Need Help With Your Academic Assignment

image

Assignment Writing Help In Canada

image

Assignment Writing Help In UAE

image

Online Assignment Writing Help in the USA

image

Assignment Writing Help In Australia

image

Assignment Writing Help In the UK

image

Scholarship Essay Writing Help

image

University of Huddersfield Assignment Help

image

Ph.D. Assignment Writing Help

image

Law Assignment Writing Help

image

Website Design and Development Assignment Help

image

University of Greenwich Assignment Assistance in the UK

best research topics software engineering

150 Best Research Paper Topics For Software Engineering

Software Engineering is a branch which deals with the creation and improvement of software applications using specific methodologies and clearly defined scientific principles. When developing software products, certain procedures must be followed, the outcome of which is a reliable and reliable software product. Software is a collection of executable code for programs with associated libraries. Software that is designed to meet certain requirements is referred to as a Software Product . This is an excellent subject for a master's thesis, research, or project. There are a variety of topics within Software Engineering which will be useful to M.Tech and other students studying for their masters to write their software thesis.

What is the reason Software Engineering is required?

Software Engineering is necessary due to the frequent shifts in the requirements of users as well as the environment. Through yourch and thesis, you will learn more about the significance of Software Engineering. Here are some other areas in software engineering that are needed:

  • Big Software: The massive dimension of software makes it necessary for the requirements in software engineering .
  • Scalability The concept of scaling Software Engineering makes it possible to increase the size of existing software rather than develop brand-new software.
  • Cost Price Software Engineering also cuts down the manufacturing cost that is incurred during software development.
  • The dynamic nature of Software - Software Engineering is a crucial factor when the need for new features is to be made in software in place, in the event that the nature of software is fluid.
  • Better Quality Management - Software Engineering can provide more efficient software development processes to provide superior-high-quality services .

Best Research Paper Topics on Software

  • Software Engineering Management Unified Software Development Process and Extreme ProgrammingThere are a lot of difficulties with managing the development of software for web-based applications and projects for systems integration that were completed in recent times.
  • The Blue Sky Software Consulting Company Analysis
  • Blue Sky Software Consulting Blue Sky Software Consulting company has seen great success over 15 years. The company is not as well-equipped for the current market.
  • LabVIEW Software: Design Systems of Measurement
  • LabVIEW is a software program that was created to design systems for measurement. LabVIEW gives you a range of instruments to control the process in an experiment.
  • Software-producing Firm Reducing Inventory
  • The link between the reduction in inventory levels and the number of orders is evident. An organization that produces software may think of increasing the amount of software to a lower level.
  • Moet Hennessy - Louis Vuitton: Enterprise Software
  • The report will demonstrate how the introduction of ERP will help LVHM Group improve its results by improving its inventories, logistics and accounting.
  • Virtualization and Software-Defined Networking
  • The goal of this paper is to analyze the developments in the field of virtualization, software-defined networks and security for networks in the last three years.
  • Computer Hardware and Software Components
  • Computers that were developed at the time of the 40s of 1940 have evolved into complex machines that require software and hardware for their operation.
  • Applications, Software and System Development
  • The usage the Microsoft Office applications greatly enhance productivity in the classroom as well as at work and during everyday activities at home.
  • PeopleSoft Inc.'s Software Architecture and Design
  • With the PIA architecture, any company with an ERP application can access all of its operations through a Web browser.
  • Co-operative Banking Group's Enterprise Software
  • The report demonstrates how the implementation of the ERP system within the Co-operative Banking Group will help in improving the company's accounting, inventory and accounting practices as well as logistics processes.
  • Software Testing: Manual and Automated Web-Application Testing Tools
  • This research is an empirical study of automated and manual web-based application testing tools to determine the best tool for testing software.
  • JDA Software Company's Services
  • JDA Software is a company that has proven its worth in the development of services in areas like manufacturing, wholesale distribution, retailing and travel.
  • Data Management, Networking and Enterprise Software
  • Enterprise software is typically developed "in-house" and thus has an inflated cost when contrasted to purchasing the software from another firm.
  • Software Workshops and Seminars Reflections
  • Most seminars inspire participants to use their potential as they strive to attain their goals.
  • The Various Enterprise Resource Planning Software Packages
  • This paper's purpose is to provide an overview of the various Enterprise Resource Planning (ERP) software applications that are widely employed by companies to manage their business operations.
  • Explore Factors in IBM SPSS Statistical Software
  • The "Explore" or "Explore" command in IBM SPSS generates an output with a variety of stats for a single variable, across the entire sample or in sections of the sample.
  • Split Variables in IBM SPSS Statistical Software
  • It is the IBM SPSS software provides an option to split files into groups. The members of cases within groups can be determined by the values of split variables in this particular instance.
  • Syntax Code Writing in Statistical Software
  • The process of analyzing quantitative data by using IBM SPSS software package IBM SPSS software package often involves performing a variety of operations to calculate the statistical data for the information.
  • Data Coding in Statistical Software
  • Data coding is of utmost importance when a proper analysis of this data has to be conducted. Data coding plays an important function when you need to make use of statistical software.
  • Software Piracy at Kaspersky Cybersecurity Company
  • Software piracy is a pressing current issue that is manifested both locally with respect to an individual company and also globally.
  • Hotjar: Web Analytics Software Difference
  • This report examines Hotjar, which is a web-based analytics tool that comes with a full set of tools to evaluate. This paper examines its strengths and advantages, as well showing how it can aid in the management of decision-making.
  • Avast Software: Company Analysis
  • Avast Software is a globally well-known multinational company that is an industry leader in providing security solutions for both business and individual customers.
  • Project Failure, Project Planning Fundamentals, and Software Tools and Techniques for Alternative Scheduling
  • From lack of communication to generally unfavourable working conditions, Projects may fail when managers fail to prepare for their implementation.
  • Computer Elements such as Hardware and Software
  • Personal computers are usually different from computers used for business in terms of capabilities and the extent of technology used within the equipment.
  • Review of a New Framework for Software Reliability Measurement
  • This study draws upon the in-depth study of the software reliability measurement methods and the suggestion of a fresh foundation for reliability measurement built on the software metrics studied in the work of Amar as well as Rabai.

Good Software Research Topics & Essay Examples

  • Task Management Software in Organization
  • The goal of the plan for managing projects is to present the process of creating task management software that can be integrated into the context of the company.
  • A task management software plan's risk management strategy
  • The present study introduces us to the techniques for risk identification as well as quality assurance and a control plan and explains their significance.
  • Computer Software Development and Reality Shows
  • The growth of software in computers has been at such a fast rate over the last 10 years that it has impacted all aspects of our lives and every fibre of our being.
  • Scrum - Software Development Process
  • Digital systems and computerized systems have brought life to many areas. Scrum is a process for software development that guarantees high quality and efficiency.
  • Distribution of Anti-Virus Software
  • Numerous new threats are reported every fortnight. Cyberattacks, viruses, and other cyber-related threats are becoming an issue.
  • Marketing Plan: Innovative Type of Software Product
  • This paper will create an advertisement plan for the new kind of software, which will help to define the segment of clients and the price and communications platform.
  • Marketing System of Sakhr Software Co
  • The principal objective of this paper is to examine the marketing process in the same type of organization, like Sakhr Software Co.
  • Managing Information of Sakhr Software Co
  • This paper will examine the ideas of managing information for Sakhr Software, which is a well-known language software firm.
  • CRM Software in Amazon: Gains
  • The software for managing customers that Amazon.com developed is, from the beginning, one of the latest technology.
  • Neurofeedback Software and Technology Comparison
  • MIDI technology helps make the making of, learning or playing more enjoyable. Mobile phones and computer keyboards for music, computers etc., utilize MIDI.
  • PeopleSoft Software and HR.net Enterprise Software
  • With the help of HRIS software, HR employees are able to manage their own benefits updates and make changes, allowing them to take more time to focus on other important tasks.
  • Business Applications: Revelation HelpDesk by Yellow Fish Software
  • "Revelation HelpDesk" is an online Tracking and Support Software that facilitates seamless coordination to occur between the most important divisions within an organization.
  • 3D signal editing methods and editing software for stereoscopic movies
  • 3D editing for movies is one of the newest trends and is among the most complex processes in the modern film industry.
  • ERP Software in Inventory Management
  • Management of inventory ERP applications will be useful when a business has to manage the manner in which it gets goods and cleans up the merchandise.
  • The Capabilities of Compiere Software and How Well It Fits Into Different Industries
  • It is the ERP software Compiere can be used by a wide variety of users, including governments, businesses as well as non-governmental organizations (NGOs).
  • Software Tools for Qualitative Research
  • This paper reviews software tools to solve complicated tasks in the analysis of data. The paper compares NVivo, HyperRESEARCH, and Dedoose.
  • Data Scientist and Software Development
  • Data scientists convert data into insights, giving elaborate guidance to those who use the data to make educated decisions and take action.
  • IPR Violations in Software Development
  • The copyright law protects only the declaration but not the software concept. It prohibits copying code from the source without asking permission.
  • Health IT: Epic Software Analysis
  • Implementation and adoption of Health IT systems are crucial to improve the efficiency of medical practices, efficiency of workflow as well as patient outcomes.
  • Agile Software Development Process
  • The agile process for software development offers numerous benefits, such as the speedy and continuous execution of your project.
  • Project Management Software and Tools Comparison
  • The software is used by managers to ensure that there isn't any worker who is receiving more work than others and also to ensure that no worker is falling behind in their job.
  • Visually impaired people: challenges in Assistive Technology Software
  • Blind people suffer from a number of disadvantages each day while using digital technology. The various types of software and software discussed in this paper have been specifically designed to help improve the lives of blind people.
  • WBS completion and software project management
  • The PERT's results resulted in the development of The Gantt chart. This essay provides an account of the method of working with the Gantt chart.
  • International Software Development's Ethical Challenges: User-Useful Software
  • The importance of ethics is when it comes to software development. It helps the creator to create software that will be useful for the user as well as the management.
  • Achieving the Optimal Process. Software Development
  • The industry of software development is growing rapidly as the requirements of users change. This requires applications to meet these needs.

Innovative Software to Blog About

  • System Software: Analysis of Various Types of System Software
  • The paper provides opinions on the various system softwares using their strengths and weaknesses from the personal experiences of the creator.
  • Sakhr Software Co.'s Marketing System
  • The principal goal of this paper is to study the uniqueness of the system of marketing in such an organization as Sakhr Software Co from Kuwait, which specializes in NLP.
  • Program Code in Assembly Language Using Easy68K Software
  • A typical scenario is described in the report to write program code in assembly language with Easy68K software. The appropriate tests were carried out with success and outputs.
  • Benefits and Drawbacks of Agile Software Development Techniques
  • The use of agile methodologies in the software development process contributes to the improvement of work as well as the effectiveness of performance.
  • The use of agile methodologies in the development of software contributes to the efficiency of work and efficiency of performance.
  • Large Scale Software Development
  • This report gives information on this Resource Scheduling project. It can be useful to an advisory firm that offers various types of resources.
  • Penguin Sleuth, a Forensic Software Tool
  • The primary goal of this paper is to examine the various tools for forensic analysis and also provide a comprehensive overview of the functions available for each tool or tool pack.
  • System Software: Computer System Management
  • Computer software comprises precise preprogrammed instructions that regulate and coordinate hardware components of the computer.
  • Ethical Issues Involved in Software Project Management
  • Ethics within IT have been proven to be very different from other areas of ethics. Ethics issues in IT are usually described as having little.
  • Advantages and Disadvantages of Software Suites
  • Computer software comprises specific preprogrammed commands that control and coordinate computer hardware components of an info system.
  • Descriptive Statistics Using SPSS Software Suite
  • This paper focuses on the process of producing the descriptive statistical analysis by using SPSS. The purpose of this article is to make use of SPSS to perform an analysis of descriptive data.
  • Software Development: Creating a Prototype
  • The aim of this article is to develop an experimental software program that can be utilized to aid breast cancer patients.
  • Software Engineering and Methodologies
  • The paper explains how the author learned the software engineering process and methods as an outcome of his experiences at BTR IT Consulting Company.
  • Information System Hardware and Software
  • Information technology covers a wide variety of applications in which computer software, along with hardware, is employed.
  • Software Development Project Using Agile Methods
  • The report will provide reasons behind why the agile methodology was chosen, the method used, how the team applied this methodology, and also the lessons learned from the massive project of software development.
  • Flight Planning Software and Aircraft Incidents
  • Software for flight planning refers to programs utilized to control and manage flights and other procedures while the plane is in flight.
  • Hardware and Software Systems and Criminal Justice
  • One of the primary techniques used to decrease the chance of criminal activity is crime mapping. This involves collecting information on crimes and their causes and then analyzing it in order to identify issues.
  • Why Open-Source Software Will (Or Will Not) Soon Dominate the Field of Database Management Tools
  • The research aims to determine whether open-source software will rule the field of the database since there is an evolution in the market for business.
  • Business HRM Software and the Affordable Care Act
  • The Affordable Care Act has its strengths but also flaws. The reason is the complex nature of the law that creates a variety of challenges.
  • Antivirus Software Ensuring Security Online
  • Although it's not perfect and fragmentary, it can be seen as a supplement and not the sole instrument; antivirus software will help protect one's privacy online.
  • Evaluating Teaching Instructional Software for 21st-Century Technology Resources
  • The software for teaching Joe Rock and Friends Book 2 is designed for third-grade students who are studying English as an additional language to read and learn new vocabulary.
  • Britam Insurance Company's Sales and Marketing Management Software
  • Britam Insurance Company needs to implement the latest marketing and management software in order to keep its place at the forefront of the extremely competitive insurance market.
  • Software Programs: Adobe Illustrator
  • With Adobe Illustrator, users can quickly and precisely create various products, like logos, icons as well as drawings.
  • Strawberry Business: Software Project Management
  • Although the company has an established management strategy as well as a team of employees and efficient information systems, it lacks a standardized workplace culture and customer relations systems.
  • Value of Salesforce Software Using VRIO Model
  • Salesforce CRM software is created to help managers manage their businesses effectively. It connects all teams and managers and collects and manages customer information.
  • Agile software development, as well as popular variations like Scrum, are the foundation for the work of a variety of testers and developers. No matter what team or method you're currently using, you can get expert guidance on process structure and the skills required to use Lean, Agile, DevOps, Waterfall and more to help you implement it for your business.

Most Interesting Software Research Titles

  • What Are the Essential Attributes of Good Software?
  • How Computer Software Can Be Used as a Tool for Education
  • Accounting Software and Application Software
  • Online National Polling Software Requirements Specification
  • Building Their Software for a Company's Success
  • The Role of Antivirus Software in Protecting Your Computer Data
  • Intellectual Property Rights, Innovation and Software Technologies
  • Software Piracy and the Canadian Piracy Act
  • For the development of software projects, agile methodologies and their Waterscrumfall derivative are used.
  • Software Tools for Improving Underground Mine Access Layouts
  • How Software Can Support Academic Librarians' Changing Role
  • Using the Untangle Software to Overcome Obstacles for Small Businesses
  • By employing travel portal software, online booking sales will increase.
  • Analysis of Network Externality and Commercial Software Piracy
  • Accounting Software and Business Solutions
  • Analysis of Key Issues and Effects Relating to International Software Piracy
  • The Distinction Between Computer Science and Software Engineering
  • Modulation: Computer Software and Unknown Music Virus
  • Math Software for High School Students with Disabilities
  • Keyboarding Software Packages: Analysis and Purchase Recommended
  • Basic Software Development Life Cycle
  • India's Problems with Software Patents, Copyright, and Piracy
  • Why Has India Been Able to Build a Thriving Software Industry
  • Does Social Software Increase Labour Productivity
  • The Role of Open Source Software for Database Servers

Simple Software Essay Ideas

  • Human Capital and the Indian Software Industry
  • Input-Output Computer Windows Software
  • Business Software Development and Its Implementation
  • Evaluating Financial Management Software: Quicken Software
  • Which governance tools are important in Africa for combating software piracy?
  • Distinguish Between Proprietary Software and Off-The-Shelf
  • Does Social Software Support Service Innovation
  • Ambulatory Revenue Management Software
  • Difference Between Operating Systems and Application Software
  • Leading a Global Insurgency in the Software Sector are China and India
  • Call Accounting Software for Every Enterprise
  • Technology Standards for Software Outsourcing
  • The Importance of the Agile Approach for Software Development
  • Application Software: Publisher, Word, and Excel
  • Employee Monitoring Through Computer Software
  • Software Development Lifecycle and Testing's Importance
  • Tools for Global Conditional Policy to Combat Software Piracy
  • Software for Designing Solar Water Heating Systems
  • Open Source Software, Competition, and Potential Entry
  • Indian Software Industry: Gains are distorted and consolidated
  • Software Programs for Disabled Computer Users and Assistive Technology
  • Agile Software Architecture, Written by Christine Miyachi
  • Software Development: The Disadvantages of Agile Methods
  • Computer Software Technology for Early Childhood
  • Developing Test Automation Software Development

Easy Software Essay Topics

  • Growth Trends, Barriers, and Government Initiatives in the Indian Software Industry
  • How Does Enterprise Software Enable a Business to Use
  • Integrated Management Software the Processing of Information
  • Computer Software Training for Doctor's Office
  • Software Intellectual Property Rights and Venture Capitalist Access
  • Computer Science Software Specification
  • Software Projects and Student Software Risk Exposure
  • Why It Is Difficult to Create Software for Wireless Devices
  • Affiliate Tracking Software Your Payment Options
  • How Can Volkswagen Recover From the Cheating Issues It Had Because Illegal Software Was Installed?
  • Principles of Best Forensic Software Tool
  • The American Software Industry: A Historical Analysis
  • How Peripheral Developers Contribute to the Development of Open-Source Software
  • Agile Methodologies for Software Development
  • Key Macroeconomic Factors That Affect Software Industry
  • The Software Industry and India's Economic Development
  • Improving Customer Service Through Help Desk Software
  • Enterprise Resource Planning and Sap Software
  • Antivirus Software and Its Importance
  • Hardware and Software Used in Public Bank
  • The Effects of Computer Software Piracy on the Global Economy
  • Using the Winqsb Software in Critical Path Analysis
  • General Information About Interactive Multimedia-Based Educational Software
  • How Affiliate Tracking Software Can Benefit You
  • Computer Software and Recent Technologies

Frequently asked questions

What are the main topics of software engineering .

software development.

  • Introduction
  • Models and architecture for software development
  • Project management for software (SPM)
  • Software prerequisites
  • Testing and debugging software

What makes good research in software engineering ?

The most typical research strategy in software engineering is coming up with a novel method or methodology, validating it through analysis, or demonstrating its application through a case study;

What projects are good for software engineering ?

  • monitoring of Android tasks.
  • Analyzing attitudes to rate products
  • ATM with a fingerprint-based method.
  • a modern system for managing employees.
  • Using the AES technique for image encryption.
  • vote-by-fingerprint technology.
  • system for predicting the weather

What are the research methods in software engineering ?

We list and contrast the five categories of research methodology that, in our opinion, are most pertinent to software engineering: controlled experiments (including quasi-experiments); case studies (both exploratory and confirmatory); survey research; ethnographies; action research; and controlled experiments.

Is software engineering a research area ?

A relatively recent area of research, software engineering is derived from computer science. Its significance has been generally acknowledged by more and more academics in the field of computers throughout the course of six decades, from 1948 to the present, and it has developed into a vibrant and promising division of the computing profession.

Is software engineering easy ?

Yes, learning software engineering can be challenging at first, especially for those without programming or coding experience or any background in technology. However, numerous courses, tools, and other resources are available to assist with learning how to become a software engineer.

Who is the father of software engineering ?

The "father of software quality," Watts S. Humphrey, was an American software engineering pioneer who lived in Battle Creek, Michigan (U.S.) from July 4, 1927, to October 28, 2010.

What do you do in software engineering ?

  • roles and tasks for software engineers
  • creating and keeping up software systems.
  • testing and evaluating new software applications.
  • software speed and scalability optimization.
  • code creation and testing.
  • consulting with stakeholders such as clients, engineers, security experts, and others.

Which is better it or software engineering ?

IT support engineers cannot build sophisticated solutions, while software engineers can. In a word, they are in charge of creating and putting into use software. Knowing the distinctions makes it easier to choose the right individual to handle our tech-related problems.

Are junior software engineers in demand ?

Yes, there is a need for young coders.

Is software engineering going down ?

Software experts and software goods are oversaturating the job market for software engineers.

What degree do I need to be a software engineer ?

undergraduate degree

Can I be a software engineer without a degree ?

Many software developers lack a degree from a reputable university (or, in some circumstances, none at all).

How many years can a software engineer work ?

An engineer who wants to work in IT has a 15–20 year window.

How many hours do software engineers work ?

Software developers put in 8 to 9 hours each day, or 40 to 45 hours per week.

best research topics software engineering

Top 10 Best Universities Ranking list in India 2022

Generic Conventions: Assignment Help

Generic Conventions: Assignment Help Services

Research Paper Topics For Medical | AHECounselling

Research Paper Topics For Medical

Top 5 Resources for Writing Excellent Academic Assignmentsb

Top 5 Resources for Writing Excellent Academic Assignments

How to Write a Literature Review for Academic Purposes

How to Write a Literature Review for Academic Purposes

best research topics software engineering

Tips for Writing a killer introduction to your assignment

How To Write A Compelling Conclusion For Your University Assignment

How To Write A Compelling Conclusion For Your University Assignment

Social Science, research ideas

Research Papers Topics For Social Science

Best 150 New Research Paper Ideas For Students

Best 150 New Research Paper Ideas For Students

7 Best Plagiarism Checkers for Students And Teachers in 2024

7 Best Plagiarism Checkers for Students And Teachers in 2024

Enquiry form.

  • Publications
  • News and Events
  • Education and Outreach

Software Engineering Institute

Cite this post.

AMS Citation

Carleton, A., 2021: Architecting the Future of Software Engineering: A Research and Development Roadmap. Carnegie Mellon University, Software Engineering Institute's Insights (blog), Accessed June 5, 2024, https://insights.sei.cmu.edu/blog/architecting-the-future-of-software-engineering-a-research-and-development-roadmap/.

APA Citation

Carleton, A. (2021, July 12). Architecting the Future of Software Engineering: A Research and Development Roadmap. Retrieved June 5, 2024, from https://insights.sei.cmu.edu/blog/architecting-the-future-of-software-engineering-a-research-and-development-roadmap/.

Chicago Citation

Carleton, Anita. "Architecting the Future of Software Engineering: A Research and Development Roadmap." Carnegie Mellon University, Software Engineering Institute's Insights (blog) . Carnegie Mellon's Software Engineering Institute, July 12, 2021. https://insights.sei.cmu.edu/blog/architecting-the-future-of-software-engineering-a-research-and-development-roadmap/.

IEEE Citation

A. Carleton, "Architecting the Future of Software Engineering: A Research and Development Roadmap," Carnegie Mellon University, Software Engineering Institute's Insights (blog) . Carnegie Mellon's Software Engineering Institute, 12-Jul-2021 [Online]. Available: https://insights.sei.cmu.edu/blog/architecting-the-future-of-software-engineering-a-research-and-development-roadmap/. [Accessed: 5-Jun-2024].

BibTeX Code

@misc{carleton_2021, author={Carleton, Anita}, title={Architecting the Future of Software Engineering: A Research and Development Roadmap}, month={Jul}, year={2021}, howpublished={Carnegie Mellon University, Software Engineering Institute's Insights (blog)}, url={https://insights.sei.cmu.edu/blog/architecting-the-future-of-software-engineering-a-research-and-development-roadmap/}, note={Accessed: 2024-Jun-5} }

Architecting the Future of Software Engineering: A Research and Development Roadmap

Headshot of Anita Carleton.

Anita Carleton

July 12, 2021, published in.

Software Engineering Research and Development

This post has been shared 10 times.

This post is coauthored by John Robert, Mark Klein, Doug Schmidt, Forrest Shull, John Foreman, Ipek Ozkaya, Robert Cunningham, Charlie Holland, Erin Harper, and Edward Desautels

Software is vital to our country’s global competitiveness, innovation, and national security. It also ensures our modern standard of living and enables continued advances in defense, infrastructure, healthcare, commerce, education, and entertainment. As the DoD’s federally funded research and development center (FFRDC) focused on improving the practice of software engineering, the Carnegie Mellon University (CMU) Software Engineering Institute (SEI) is leading the community in creating a multi-year research and development vision and roadmap for engineering next-generation software-reliant systems. This blog post describes that effort.

Software Engineering as Strategic Advantage

In a 2020 National Academy of Science Study on Air Force software sustainment , the U.S. Air Force recognized that “to continue to be a world-class fighting force, it needs to be a world-class software developer.” This concept clearly applies far beyond the Department of Defense . Software systems enable world-class healthcare, commerce, education, energy generation, and more. These systems that run our world are rapidly becoming more data intensive and interconnected, increasingly utilize AI, require larger-scale integration, and must be considerably more resilient. Consequently, significant investment in software engineering R&D is needed now to enable and ensure future capability.

Goals of This Work

The SEI has leveraged its connections with academic institutions and communities, DoD leaders and members of the Defense Industrial Base , and industry innovators and research organizations to:

  • identify future challenges in engineering software-reliant and intelligent systems in emerging, national-priority technical domains, including gaps between current engineering techniques and future domains that will be more reliant on continuous evolution and AI
  • develop a research roadmap that will drive advances in foundational software engineering principles across a range of system types, such as intelligent, safety-critical, and data-intensive systems
  • raise the visibility of software to the point where it receives the sustained recognition commensurate with its importance to national security and competitiveness
  • enable strategic partnerships and collaborations to drive innovation among industry, academia, and government.

Guided by an Advisory Board of U.S. Visionaries and Senior Thought Leaders

To succeed in developing our vision and roadmap for software engineering research and development, it is vital to coordinate the academic, defense, and commercial communities to define an effective agenda and implement impactful results. To help represent the views of all these software engineering constituencies, the SEI formed an advisory board from DoD, industry, academia, research labs, and technology companies to offer guidance. Members of this advisory board include the following:

  • Deb Frincke , advisory board chair, Associate Laboratory Director for National Security Sciences, Oak Ridge National Laboratory
  • Michael McQuade , vice president for research, Carnegie Mellon University
  • Vint Cerf , vice president and chief internet evangelist, Google
  • Penny Compton , vice president for software systems, cyber, and operations, Lockheed Martin Space
  • Tim Dare , deputy director for prototyping and software, Office of the Under Secretary of Defense for Research and Engineering (previous position)
  • Sara Manning Dawson , chief technology officer enterprise security, Microsoft
  • Jeff Dexter , senior director of flight software & cybersecurity, SPACEX
  • Yolanda Gil, president, Association for the Advancement of Artificial Intelligence (AAAI); Director of Knowledge Technologies, Information Sciences Institute at University of Southern California
  • Tim McBride , president, Zoic Studios
  • Nancy Pendleton , vice president and senior chief engineer for mission systems, payloads and sensors, Boeing Defense, Space and Security
  • William Scherlis , director Information Innovation Office, DARPA

In June 2020, the SEI assembled this board to leverage their diverse perspectives and provide strategic advice, influence stakeholders, develop connections, assist in executing the roadmap, and advocate for the use of our results.

Future Systems and Fundamental Shifts in Software Engineering Require New Research Focus

Rapidly deploying software with confidence requires fundamental shifts in software engineering. New types of systems will continue to push beyond the bounds of what current software engineering theories, tools, and practices can support, including (but not limited to):

  • Systems that fuse data at a huge scale, whether for news, entertainment, or intelligence: We will need to continuously mine vast amounts of open-source data streams (e.g., YouTube videos and Twitter feeds) for important information that will in turn drive decision making. This vast stream of data will also drive new ways of constructing systems.
  • Smart cities, buildings, roads, cars, and transport: How will these highly connected systems work together seamlessly? How will we enable safe and affordable transportation and living?
  • Personal digital assistants: How will these assistants learn, adapt, and engage in home and business workflows?
  • Dynamically integrated healthcare: Data from your personal device will be combined with hospital data. How do we meet stringent safety and privacy requirements? How do we evaluate assurance in a highly data-driven environment?
  • Mission-level adaptation for DoD systems: DoD systems will feature mission-level construction of new integrated systems that combine a range of capabilities, such as intel, weapons, and human/machine teaming. The DoD is already moving in this direction, but how can we increase confidence that there will be no unintended consequences?

A Guiding Vision of the Future of Software Engineering

Our guiding vision is one in which the current notion of software development is replaced by the concept of a software pipeline consisting of humans and software as trustworthy collaborators who rapidly evolve systems based on user intent. To achieve this vision, we anticipate the need for not only new development paradigms but also new architectural paradigms for engineering new kinds of systems.

Advanced development paradigms, such as those listed below, lead to efficiency and trust at scale:

  • Humans leverage trusted AI as a workforce multiplier for all aspects of software creation.
  • Formal assurance arguments are evolved to assure and efficiently re-assure continuously evolving software.
  • Advanced software composition mechanisms enable predictable construction of systems at increasingly large scale.

Advanced architectural paradigms, as outlined below, enable the predictable use of new computational models:

  • Theories and techniques drawn from the behavioral sciences are used to design large-scale socio-technical systems, leading to predictable social outcomes.
  • New analysis and design methods facilitate the development of quantum-enabled systems.

AI and non-AI components interact in predictable ways to achieve enhanced mission, societal, and business goals.

Research Focus Areas

The fundamental shifts and needed advances in software engineering described above require new areas of research. In close collaboration with our advisory board and other leaders in the software engineering community, we have developed a research roadmap with six focus areas. Figure 1 shows those areas and outlines a suggested course of research topics to undertake. Short descriptions of each focus area and its challenges follow.

Figure 1: Software Engineering Research Roadmap with Research Focus Areas and Research Objectives (10-15 Year Horizon)

  • AI-Augmented Software Development . At almost every stage of the software development process, AI holds the promise of assisting humans. By relieving humans of tedious tasks, they will be better able to focus on tasks that require the creativity and innovation that only humans can provide. To reach this goal, we need to re-envision the entire software development process with increased AI and automation tool support for developers, and we need to ensure we take advantage of the data generated throughout the entire lifecycle. The focus of this research area is on what AI-augmented software development will look like at each stage of the development process and during continuous evolution, where it will be particularly useful in taking on routine tasks.
  • Assuring Continuously Evolving Systems . When we consider the software-reliant systems of today, we see that they are not static (or even infrequently updated) engineering artifacts. Instead, they are fluid—meaning that they are expected to undergo continuing updates and improvements throughout their lifespan. The goal of this research area is therefore to develop a theory and practice of rapid and assured software evolution that enables efficient and bounded re-assurance of continuously evolving systems.
  • Software Construction through Compositional Correctness . As the scope and scale of software-reliant systems continues to grow and change continuously, the complexity of these systems makes it unrealistic for any one person or group to understand the entire system. It is therefore necessary to integrate (and continually re-integrate) software-reliant systems using technologies and platforms that support the composition of modular components, many of which are reused from existing elements that were not designed to be integrated or evolved together. The goal of this research area is to create methods and tools (such as domain specific modeling language and annotation-based dependency injection) that enable the specification and enforcement of composition rules that allow (1) the creation of required behaviors (both functionality and quality attributes) and (2) the assurance of these behaviors.
  • Engineering Socio-Technical Systems . Societal-scale software systems, such as today’s commercial social media systems, are designed to keep users engaged to influence them. However, avoiding bias and ensuring the accuracy of information are not always goals or outcomes of these systems. Engineering societal-scale systems focuses on prediction of such outcomes (which we refer to as socially inspired quality attributes) that arise when we humans as integral components of the system. The goal is to leverage insights from the social sciences to build and evolve societal-scale software systems that consider qualities such as bias and influence.
  • Engineering AI-enabled Software Systems . AI-enabled systems, which are software-reliant systems that include AI and non-AI components, have some inherently different characteristics than those without AI. However, AI-enabled systems are, above all, a type of software system. These systems have many parallels with the development and sustainment of more conventional software-reliant systems. This research area focuses on exploring which existing software engineering practices can reliably support the development of AI systems, as well as identifying and augmenting software engineering techniques for the specification, design, architecture, analysis, deployment, and sustainment of systems with AI components.
  • Engineering Quantum Computing Systems . Advances in software engineering for quantum are as important as the hardware advances. The goals of this research area are to first enable current quantum computers so they can be programmed more easily and reliably, and then enable increasing abstraction as larger, fully fault-tolerant quantum computing systems become available. Eventually, it should be possible fully integrate these types of systems into a unified classical and quantum software development lifecycle.

Help Shape Our National Software Research Agenda

Along with the advisory board, our research team has examined future trends in the computing landscape and emerging technologies; conducted a series of expert interviews; and convened multiple workshops for broad engagement and diverse perspectives, including a workshop on Software Engineering Grand Challenges and Future Visions co-hosted with the Defense Advanced Research Projects Agency (DARPA) . This workshop brought together leaders in the software engineering research and development community to describe (1) important classes of future software-reliant systems and their associated software engineering challenges, and (2) research methods, tools, and practices that are needed to make those systems feasible. An upcoming SEI blog post will provide a synopsis of what was covered in this workshop.

Your feedback would be appreciated on the software engineering challenges and proposed research focus areas to help inform the National Agenda for Software Engineering Study. Please email [email protected] to send your thoughts and comments on the software engineering study & research roadmap or to volunteer as a potential reviewer of study drafts. Thank you.

Headshot of Anita Carleton.

Author Page

Digital library publications, send a message, more by the author, application of large language models (llms) in software engineering: overblown hype or disruptive change, october 2, 2023 • by ipek ozkaya , anita carleton , john e. robert , douglas schmidt (vanderbilt university), join the sei and white house ostp to explore the future of software and ai engineering, may 30, 2023 • by anita carleton , john e. robert , mark h. klein , douglas schmidt (vanderbilt university) , erin harper, software engineering as a strategic advantage: a national roadmap for the future, november 15, 2021 • by anita carleton , john e. robert , mark h. klein , erin harper, more in software engineering research and development, the latest work from the sei: an openai collaboration, generative ai, and zero trust, april 10, 2024 • by douglas schmidt (vanderbilt university), applying the sei sbom framework, february 5, 2024 • by carol woody, 10 benefits and 10 challenges of applying large language models to dod software acquisition, january 22, 2024 • by john e. robert , douglas schmidt (vanderbilt university), the latest work from the sei, january 15, 2024 • by douglas schmidt (vanderbilt university), the top 10 blog posts of 2023, january 8, 2024 • by douglas schmidt (vanderbilt university), get updates on our latest work..

Sign up to have the latest post sent to your inbox weekly.

Each week, our researchers write about the latest in software engineering, cybersecurity and artificial intelligence. Sign up to get the latest post sent to your inbox the day it's published.

Journal of Software Engineering Research and Development Cover Image

  • Search by keyword
  • Search by citation

Page 1 of 2

Metric-centered and technology-independent architectural views for software comprehension

The maintenance of applications is a crucial activity in the software industry. The high cost of this process is due to the effort invested on software comprehension since, in most of cases, there is no up-to-...

  • View Full Text

Back to the future: origins and directions of the “Agile Manifesto” – views of the originators

In 2001, seventeen professionals set up the manifesto for agile software development. They wanted to define values and basic principles for better software development. On top of being brought into focus, the ...

Investigating the effectiveness of peer code review in distributed software development based on objective and subjective data

Code review is a potential means of improving software quality. To be effective, it depends on different factors, and many have been investigated in the literature to identify the scenarios in which it adds qu...

On the benefits and challenges of using kanban in software engineering: a structured synthesis study

Kanban is increasingly being used in diverse software organizations. There is extensive research regarding its benefits and challenges in Software Engineering, reported in both primary and secondary studies. H...

Challenges on applying genetic improvement in JavaScript using a high-performance computer

Genetic Improvement is an area of Search Based Software Engineering that aims to apply evolutionary computing operators to the software source code to improve it according to one or more quality metrics. This ...

Actor’s social complexity: a proposal for managing the iStar model

Complex systems are inherent to modern society, in which individuals, organizations, and computational elements relate with each other to achieve a predefined purpose, which transcends individual goals. In thi...

Investigating measures for applying statistical process control in software organizations

The growing interest in improving software processes has led organizations to aim for high maturity, where statistical process control (SPC) is required. SPC makes it possible to analyze process behavior, pred...

An approach for applying Test-Driven Development (TDD) in the development of randomized algorithms

TDD is a technique traditionally applied in applications with deterministic algorithms, in which the input and the expected result are known. However, the application of TDD with randomized algorithms have bee...

Supporting governance of mobile application developers from mining and analyzing technical questions in stack overflow

There is a need to improve the direct communication between large organizations that maintain mobile platforms (e.g. Apple, Google, and Microsoft) and third-party developers to solve technical questions that e...

Working software over comprehensive documentation – Rationales of agile teams for artefacts usage

Agile software development (ASD) promotes working software over comprehensive documentation. Still, recent research has shown agile teams to use quite a number of artefacts. Whereas some artefacts may be adopt...

Development as a journey: factors supporting the adoption and use of software frameworks

From the point of view of the software framework owner, attracting new and supporting existing application developers is crucial for the long-term success of the framework. This mixed-methods study explores th...

Applying user-centered techniques to analyze and design a mobile application

Techniques that help in understanding and designing user needs are increasingly being used in Software Engineering to improve the acceptance of applications. Among these techniques we can cite personas, scenar...

A measurement model to analyze the effect of agile enterprise architecture on geographically distributed agile development

Efficient and effective communication (active communication) among stakeholders is thought to be central to agile development. However, in geographically distributed agile development (GDAD) environments, it c...

A survey of search-based refactoring for software maintenance

This survey reviews published materials related to the specific area of Search-Based Software Engineering that concerns software maintenance and, in particular, refactoring. The survey aims to give a comprehen...

Guest editorial foreword for the special issue on automated software testing: trends and evidence

Similarity testing for role-based access control systems.

Access control systems demand rigorous verification and validation approaches, otherwise, they can end up with security breaches. Finite state machines based testing has been successfully applied to RBAC syste...

An algorithm for combinatorial interaction testing: definitions and rigorous evaluations

Combinatorial Interaction Testing (CIT) approaches have drawn attention of the software testing community to generate sets of smaller, efficient, and effective test cases where they have been successful in det...

How diverse is your team? Investigating gender and nationality diversity in GitHub teams

Building an effective team of developers is a complex task faced by both software companies and open source communities. The problem of forming a “dream”

Investigating factors that affect the human perception on god class detection: an analysis based on a family of four controlled experiments

Evaluation of design problems in object oriented systems, which we call code smells, is mostly a human-based task. Several studies have investigated the impact of code smells in practice. Studies focusing on h...

On the evaluation of code smells and detection tools

Code smells refer to any symptom in the source code of a program that possibly indicates a deeper problem, hindering software maintenance and evolution. Detection of code smells is challenging for developers a...

On the influence of program constructs on bug localization effectiveness

Software projects often reach hundreds or thousands of files. Therefore, manually searching for code elements that should be changed to fix a failure is a difficult task. Static bug localization techniques pro...

DyeVC: an approach for monitoring and visualizing distributed repositories

Software development using distributed version control systems has become more frequent recently. Such systems bring more flexibility, but also greater complexity to manage and monitor multiple existing reposi...

A genetic algorithm based framework for software effort prediction

Several prediction models have been proposed in the literature using different techniques obtaining different results in different contexts. The need for accurate effort predictions for projects is one of the ...

Elaboration of software requirements documents by means of patterns instantiation

Studies show that problems associated with the requirements specifications are widely recognized for affecting software quality and impacting effectiveness of its development process. The reuse of knowledge ob...

ArchReco: a software tool to assist software design based on context aware recommendations of design patterns

This work describes the design, development and evaluation of a software Prototype, named ArchReco, an educational tool that employs two types of Context-aware Recommendations of Design Patterns, to support us...

On multi-language software development, cross-language links and accompanying tools: a survey of professional software developers

Non-trivial software systems are written using multiple (programming) languages, which are connected by cross-language links. The existence of such links may lead to various problems during software developmen...

SoftCoDeR approach: promoting Software Engineering Academia-Industry partnership using CMD, DSR and ESE

The Academia-Industry partnership has been increasingly encouraged in the software development field. The main focus of the initiatives is driven by the collaborative work where the scientific research work me...

Issues on developing interoperable cloud applications: definitions, concepts, approaches, requirements, characteristics and evaluation models

Among research opportunities in software engineering for cloud computing model, interoperability stands out. We found that the dynamic nature of cloud technologies and the battle for market domination make clo...

Game development software engineering process life cycle: a systematic review

Software game is a kind of application that is used not only for entertainment, but also for serious purposes that can be applicable to different domains such as education, business, and health care. Multidisc...

Correlating automatic static analysis and mutation testing: towards incremental strategies

Traditionally, mutation testing is used as test set generation and/or test evaluation criteria once it is considered a good fault model. This paper uses mutation testing for evaluating an automated static anal...

A multi-objective test data generation approach for mutation testing of feature models

Mutation approaches have been recently applied for feature testing of Software Product Lines (SPLs). The idea is to select products, associated to mutation operators that describe possible faults in the Featur...

An extended global software engineering taxonomy

In Global Software Engineering (GSE), the need for a common terminology and knowledge classification has been identified to facilitate the sharing and combination of knowledge by GSE researchers and practition...

A systematic process for obtaining the behavior of context-sensitive systems

Context-sensitive systems use contextual information in order to adapt to the user’s current needs or requirements failure. Therefore, they need to dynamically adapt their behavior. It is of paramount importan...

Distinguishing extended finite state machine configurations using predicate abstraction

Extended Finite State Machines (EFSMs) provide a powerful model for the derivation of functional tests for software systems and protocols. Many EFSM based testing problems, such as mutation testing, fault diag...

Extending statecharts to model system interactions

Statecharts are diagrams comprised of visual elements that can improve the modeling of reactive system behaviors. They extend conventional state diagrams with the notions of hierarchy, concurrency and communic...

On the relationship of code-anomaly agglomerations and architectural problems

Several projects have been discontinued in the history of the software industry due to the presence of software architecture problems. The identification of such problems in source code is often required in re...

An approach based on feature models and quality criteria for adapting component-based systems

Feature modeling has been widely used in domain engineering for the development and configuration of software product lines. A feature model represents the set of possible products or configurations to apply i...

Patch rejection in Firefox: negative reviews, backouts, and issue reopening

Writing patches to fix bugs or implement new features is an important software development task, as it contributes to raise the quality of a software system. Not all patches are accepted in the first attempt, ...

Investigating probabilistic sampling approaches for large-scale surveys in software engineering

Establishing representative samples for Software Engineering surveys is still considered a challenge. Specialized literature often presents limitations on interpreting surveys’ results, mainly due to the use o...

Characterising the state of the practice in software testing through a TMMi-based process

The software testing phase, despite its importance, is usually compromised by the lack of planning and resources in industry. This can risk the quality of the derived products. The identification of mandatory ...

Self-adaptation by coordination-targeted reconfigurations

A software system is self-adaptive when it is able to dynamically and autonomously respond to changes detected either in its internal components or in its deployment environment. This response is expected to ensu...

Templates for textual use cases of software product lines: results from a systematic mapping study and a controlled experiment

Use case templates can be used to describe functional requirements of a Software Product Line. However, to the best of our knowledge, no efforts have been made to collect and summarize these existing templates...

F3T: a tool to support the F3 approach on the development and reuse of frameworks

Frameworks are used to enhance the quality of applications and the productivity of the development process, since applications may be designed and implemented by reusing framework classes. However, frameworks ...

NextBug: a Bugzilla extension for recommending similar bugs

Due to the characteristics of the maintenance process followed in open source systems, developers are usually overwhelmed with a great amount of bugs. For instance, in 2012, approximately 7,600 bugs/month were...

Assessing the benefits of search-based approaches when designing self-adaptive systems: a controlled experiment

The well-orchestrated use of distilled experience, domain-specific knowledge, and well-informed trade-off decisions is imperative if we are to design effective architectures for complex software-intensive syst...

Revealing influence of model structure and test case profile on the prioritization of test cases in the context of model-based testing

Test case prioritization techniques aim at defining an order of test cases that favor the achievement of a goal during test execution, such as revealing failures as earlier as possible. A number of techniques ...

A metrics suite for JUnit test code: a multiple case study on open source software

The code of JUnit test cases is commonly used to characterize software testing effort. Different metrics have been proposed in literature to measure various perspectives of the size of JUnit test cases. Unfort...

Designing fault-tolerant SOA based on design diversity

Over recent years, software developers have been evaluating the benefits of both Service-Oriented Architecture (SOA) and software fault tolerance techniques based on design diversity. This is achieved by creat...

Method-level code clone detection through LWH (Light Weight Hybrid) approach

Many researchers have investigated different techniques to automatically detect duplicate code in programs exceeding thousand lines of code. These techniques have limitations in finding either the structural o...

The problem of conceptualization in god class detection: agreement, strategies and decision drivers

The concept of code smells is widespread in Software Engineering. Despite the empirical studies addressing the topic, the set of context-dependent issues that impacts the human perception of what is a code sme...

  • Editorial Board
  • Sign up for article alerts and news from this journal

Software Engineering

At Google, we pride ourselves on our ability to develop and launch new products and features at a very fast pace. This is made possible in part by our world-class engineers, but our approach to software development enables us to balance speed and quality, and is integral to our success. Our obsession for speed and scale is evident in our developer infrastructure and tools. Developers across the world continually write, build, test and release code in multiple programming languages like C++, Java, Python, Javascript and others, and the Engineering Tools team, for example, is challenged to keep this development ecosystem running smoothly. Our engineers leverage these tools and infrastructure to produce clean code and keep software development running at an ever-increasing scale. In our publications, we share associated technical challenges and lessons learned along the way.

Recent Publications

Some of our teams.

Africa team

Climate and sustainability

Software engineering and programming languages

We're always looking for more talented, passionate people.

Careers

software engineering Recently Published Documents

Total documents.

  • Latest Documents
  • Most Cited Documents
  • Contributed Authors
  • Related Sources
  • Related Keywords

Identifying Non-Technical Skill Gaps in Software Engineering Education: What Experts Expect But Students Don’t Learn

As the importance of non-technical skills in the software engineering industry increases, the skill sets of graduates match less and less with industry expectations. A growing body of research exists that attempts to identify this skill gap. However, only few so far explicitly compare opinions of the industry with what is currently being taught in academia. By aggregating data from three previous works, we identify the three biggest non-technical skill gaps between industry and academia for the field of software engineering: devoting oneself to continuous learning , being creative by approaching a problem from different angles , and thinking in a solution-oriented way by favoring outcome over ego . Eight follow-up interviews were conducted to further explore how the industry perceives these skill gaps, yielding 26 sub-themes grouped into six bigger themes: stimulating continuous learning , stimulating creativity , creative techniques , addressing the gap in education , skill requirements in industry , and the industry selection process . With this work, we hope to inspire educators to give the necessary attention to the uncovered skills, further mitigating the gap between the industry and the academic world.

Opportunities and Challenges in Code Search Tools

Code search is a core software engineering task. Effective code search tools can help developers substantially improve their software development efficiency and effectiveness. In recent years, many code search studies have leveraged different techniques, such as deep learning and information retrieval approaches, to retrieve expected code from a large-scale codebase. However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code search studies, we systematically reviewed 81 relevant studies. We investigated the publication trends of code search studies, analyzed key components, such as codebase, query, and modeling technique used to build code search tools, and classified existing tools into focusing on supporting seven different search tasks. Based on our findings, we identified a set of outstanding challenges in existing studies and a research roadmap for future code search research.

Psychometrics in Behavioral Software Engineering: A Methodological Introduction with Guidelines

A meaningful and deep understanding of the human aspects of software engineering (SE) requires psychological constructs to be considered. Psychology theory can facilitate the systematic and sound development as well as the adoption of instruments (e.g., psychological tests, questionnaires) to assess these constructs. In particular, to ensure high quality, the psychometric properties of instruments need evaluation. In this article, we provide an introduction to psychometric theory for the evaluation of measurement instruments for SE researchers. We present guidelines that enable using existing instruments and developing new ones adequately. We conducted a comprehensive review of the psychology literature framed by the Standards for Educational and Psychological Testing. We detail activities used when operationalizing new psychological constructs, such as item pooling, item review, pilot testing, item analysis, factor analysis, statistical property of items, reliability, validity, and fairness in testing and test bias. We provide an openly available example of a psychometric evaluation based on our guideline. We hope to encourage a culture change in SE research towards the adoption of established methods from psychology. To improve the quality of behavioral research in SE, studies focusing on introducing, validating, and then using psychometric instruments need to be more common.

Towards an Anatomy of Software Craftsmanship

Context: The concept of software craftsmanship has early roots in computing, and in 2009, the Manifesto for Software Craftsmanship was formulated as a reaction to how the Agile methods were practiced and taught. But software craftsmanship has seldom been studied from a software engineering perspective. Objective: The objective of this article is to systematize an anatomy of software craftsmanship through literature studies and a longitudinal case study. Method: We performed a snowballing literature review based on an initial set of nine papers, resulting in 18 papers and 11 books. We also performed a case study following seven years of software development of a product for the financial market, eliciting qualitative, and quantitative results. We used thematic coding to synthesize the results into categories. Results: The resulting anatomy is centered around four themes, containing 17 principles and 47 hierarchical practices connected to the principles. We present the identified practices based on the experiences gathered from the case study, triangulating with the literature results. Conclusion: We provide our systematically derived anatomy of software craftsmanship with the goal of inspiring more research into the principles and practices of software craftsmanship and how these relate to other principles within software engineering in general.

On the Reproducibility and Replicability of Deep Learning in Software Engineering

Context: Deep learning (DL) techniques have gained significant popularity among software engineering (SE) researchers in recent years. This is because they can often solve many SE challenges without enormous manual feature engineering effort and complex domain knowledge. Objective: Although many DL studies have reported substantial advantages over other state-of-the-art models on effectiveness, they often ignore two factors: (1) reproducibility —whether the reported experimental results can be obtained by other researchers using authors’ artifacts (i.e., source code and datasets) with the same experimental setup; and (2) replicability —whether the reported experimental result can be obtained by other researchers using their re-implemented artifacts with a different experimental setup. We observed that DL studies commonly overlook these two factors and declare them as minor threats or leave them for future work. This is mainly due to high model complexity with many manually set parameters and the time-consuming optimization process, unlike classical supervised machine learning (ML) methods (e.g., random forest). This study aims to investigate the urgency and importance of reproducibility and replicability for DL studies on SE tasks. Method: In this study, we conducted a literature review on 147 DL studies recently published in 20 SE venues and 20 AI (Artificial Intelligence) venues to investigate these issues. We also re-ran four representative DL models in SE to investigate important factors that may strongly affect the reproducibility and replicability of a study. Results: Our statistics show the urgency of investigating these two factors in SE, where only 10.2% of the studies investigate any research question to show that their models can address at least one issue of replicability and/or reproducibility. More than 62.6% of the studies do not even share high-quality source code or complete data to support the reproducibility of their complex models. Meanwhile, our experimental results show the importance of reproducibility and replicability, where the reported performance of a DL model could not be reproduced for an unstable optimization process. Replicability could be substantially compromised if the model training is not convergent, or if performance is sensitive to the size of vocabulary and testing data. Conclusion: It is urgent for the SE community to provide a long-lasting link to a high-quality reproduction package, enhance DL-based solution stability and convergence, and avoid performance sensitivity on different sampled data.

Predictive Software Engineering: Transform Custom Software Development into Effective Business Solutions

The paper examines the principles of the Predictive Software Engineering (PSE) framework. The authors examine how PSE enables custom software development companies to offer transparent services and products while staying within the intended budget and a guaranteed budget. The paper will cover all 7 principles of PSE: (1) Meaningful Customer Care, (2) Transparent End-to-End Control, (3) Proven Productivity, (4) Efficient Distributed Teams, (5) Disciplined Agile Delivery Process, (6) Measurable Quality Management and Technical Debt Reduction, and (7) Sound Human Development.

Software—A New Open Access Journal on Software Engineering

Software (ISSN: 2674-113X) [...]

Improving bioinformatics software quality through incorporation of software engineering practices

Background Bioinformatics software is developed for collecting, analyzing, integrating, and interpreting life science datasets that are often enormous. Bioinformatics engineers often lack the software engineering skills necessary for developing robust, maintainable, reusable software. This study presents review and discussion of the findings and efforts made to improve the quality of bioinformatics software. Methodology A systematic review was conducted of related literature that identifies core software engineering concepts for improving bioinformatics software development: requirements gathering, documentation, testing, and integration. The findings are presented with the aim of illuminating trends within the research that could lead to viable solutions to the struggles faced by bioinformatics engineers when developing scientific software. Results The findings suggest that bioinformatics engineers could significantly benefit from the incorporation of software engineering principles into their development efforts. This leads to suggestion of both cultural changes within bioinformatics research communities as well as adoption of software engineering disciplines into the formal education of bioinformatics engineers. Open management of scientific bioinformatics development projects can result in improved software quality through collaboration amongst both bioinformatics engineers and software engineers. Conclusions While strides have been made both in identification and solution of issues of particular import to bioinformatics software development, there is still room for improvement in terms of shifts in both the formal education of bioinformatics engineers as well as the culture and approaches of managing scientific bioinformatics research and development efforts.

Inter-team communication in large-scale co-located software engineering: a case study

AbstractLarge-scale software engineering is a collaborative effort where teams need to communicate to develop software products. Managers face the challenge of how to organise work to facilitate necessary communication between teams and individuals. This includes a range of decisions from distributing work over teams located in multiple buildings and sites, through work processes and tools for coordinating work, to softer issues including ensuring well-functioning teams. In this case study, we focus on inter-team communication by considering geographical, cognitive and psychological distances between teams, and factors and strategies that can affect this communication. Data was collected for ten test teams within a large development organisation, in two main phases: (1) measuring cognitive and psychological distance between teams using interactive posters, and (2) five focus group sessions where the obtained distance measurements were discussed. We present ten factors and five strategies, and how these relate to inter-team communication. We see three types of arenas that facilitate inter-team communication, namely physical, virtual and organisational arenas. Our findings can support managers in assessing and improving communication within large development organisations. In addition, the findings can provide insights into factors that may explain the challenges of scaling development organisations, in particular agile organisations that place a large emphasis on direct communication over written documentation.

Aligning Software Engineering and Artificial Intelligence With Transdisciplinary

Study examined AI and SE transdisciplinarity to find ways of aligning them to enable development of AI-SE transdisciplinary theory. Literature review and analysis method was used. The findings are AI and SE transdisciplinarity is tacit with islands within and between them that can be linked to accelerate their transdisciplinary orientation by codification, internally developing and externally borrowing and adapting transdisciplinary theories. Lack of theory has been identified as the major barrier toward towards maturing the two disciplines as engineering disciplines. Creating AI and SE transdisciplinary theory would contribute to maturing AI and SE engineering disciplines.  Implications of study are transdisciplinary theory can support mode 2 and 3 AI and SE innovations; provide an alternative for maturing two disciplines as engineering disciplines. Study’s originality it’s first in SE, AI or their intersections.

Export Citation Format

Share document.

Invenia Blog

Blogging About Electricity Grids, Julia, and Machine Learning

The Hitchhiker’s Guide to Research Software Engineering: From PhD to RSE

Author: Glenn Moynihan

In 2017, the twilight days of my PhD in computational physics, I found myself ready to leave academia behind. While my research was interesting, it was not what I wanted to pursue full time. However, I was happy with the type of work I was doing, contributing to research software, and I wanted to apply myself in a more industrial setting.

Many postgraduates face a similar decision. A study conducted by the Royal Society in 2010 reported that only 3.5% of PhD graduates end up in permanent research positions in academia. Leaving aside the roots of the brain drain on Universities, it is a compelling statistic that the vast majority of post-graduates end up leaving academia for industry at some point in their career. It comes as no surprise that there are a growing number of bootcamps like S2DS , faculty.ai , and Insight that have sprung up in response to this trend, for machine learning and data science especially. There are also no shortage of helpful forum discussions and blog posts outlining what you should do in order to “break into the industry”, as well as many that relate the personal experiences of those who ultimately made the switch.

While the advice that follows in this blog post is directed at those looking to change careers, it would equally benefit those who opt to remain in the academic track. Since the environment and incentives around building academic research software are very different to those of industry, the workflows around the former are, in general, not guided by the same engineering practices that are valued in the latter.

That is to say: there is a difference between what is important in writing software for research, and for a user-focused, software product . Academic research software prioritises scientific correctness and flexibility to experiment above all else in pursuit of the researchers’ end product: published papers. Industry software, on the other hand, prioritises maintainability, robustness, and testing as the software (generally speaking) is the product.

However, the two tracks share many common goals as well, such as catering to “users”, emphasising performance and reproducibility , but most importantly both ventures are collaborative . Arguably then, both sets of principles are needed to write and maintain high-quality research software. Incidentally, the Research Software Engineering group at Invenia is uniquely tasked with incorporating all these incentives into the development of our research packages in order to get the best of both worlds. But I digress.

What I wish I knew in my PhD

Most postgrads are self-taught programmers and learn from the same resources as their peers and collaborators, which are ostensibly adequate for academia. Many also tend to work in isolation on their part of the code base and don’t require merging with other contributors’ work very frequently. In industry, however, continuous integration underpins many development workflows. Under a continuous delivery cycle, a developer benefits from the prompt feedback and cooperation of a full team of professional engineers and can, therefore, learn to implement engineering best practices more efficiently.

As such, it feels like a missed opportunity for universities not to promote good engineering practices more and teach them to their students. Not least because having stable and maintainable tools are, in a sense, “public goods” in academia as much as industry. Yet, while everyone gains from improving the tools, researchers are not generally incentivised to invest their precious time or effort on these tasks unless it is part of some well-funded, high-impact initiative. As Jake VanderPlas remarked : “any time spent building and documenting software tools is time spent not writing research papers, which are the primary currency of the academic reward structure”.

Speaking personally, I learned a great deal about conducting research and scientific computing in my PhD; I could read and write code, squash bugs, and I wasn’t afraid of getting my hands dirty in monolithic code bases. As such, I felt comfortable at the command line but I failed to learn the basic tenets of proper code maintenance, unit testing, code review, version control, etc., that underpin good software engineering. While I had enough coding experience to have a sense of this at the time, I lacked the awareness of what I needed to know in order to improve or even where to start looking.

As is clear from the earlier statistic, this experience is likely not unique to me. It prompted me to share what I’ve learned since joining Invenia 18 months ago, so that it might guide those looking to make a similar move. The advice I provide is organised into three sections: the first recommends ways to learn a new programming language efficiently 1 ; the second describes some best practices you can adopt to improve the quality of the code you write; and the last commends the social aspect of community-driven software collaborations.

Lesson 1: Hone your craft

Practice : While clichéd, there is no avoiding the fact that it takes consistent practice over many many years to become masterful at anything, and programming is no exception.

Have personal projects : Practicing is easier said than done if your job doesn’t revolve around programming. A good way to get started either way is to undertake personal side-projects as a fun way to get to grips with a language, for instance via Project Euler , Kaggle Competitions , etc. These should be enough to get you off the ground and familiar with the syntax of the language.

Read code : Personal projects on their own are not enough to improve. If you really want to get better, you’ve got to read other people’s code: a lot of it. Check out the repositories of some of your favourite or most used packages—particularly if they are considered “high quality” 2 . See how the package is organised, how the documentation is written, and how the code is structured. Look at the open issues and pull requests. Who are the main contributors? Get a sense of what is being worked on and how the open-source community operates. This will give you an idea of the open issues facing the package and the language and the direction it is taking. It will also show you how to write idiomatic code , that is, in a way that is natural for that language.

Contribute : You should actually contribute to the code base you use. This is by far the most important advice for improving and I cannot overstate how instructive an experience this is. By getting your code reviewed you get prompt and informative feedback on what you’re doing wrong and how you can do better. It gives you the opportunity to try out what you’ve learned, learn something new, and improves your confidence in your ability. Contributing to open source and seeing your features being used is also rewarding, and that starts a positive feedback loop where you feel like contributing more. Further, when you start applying for jobs in industry people can see your work, and so know that you are good at what you do (I say this as a person who is now involved in reviewing these applications).

Study : Learning by experience is great but—at least for me—it takes a deliberate approach to formalise and cement new ideas. Read well-reviewed books on your language (appropriate for your level) and reinforce what you learn by tackling more complex tasks and venturing outside your comfort zone . Reading blog posts and articles about the language is also a great idea.

Ask for help: Sometimes a bug just stumps you, or you just don’t know how to implement a feature. In these circumstances, it’s quicker to reach out to experts who can help and maybe teach you something at the same time. More often than not, someone has had the same problem or they’re happy to point you in the right direction. I’m fortunate to work with Julia experts at Invenia, so when I have a problem they are always most helpful. But posting on public fora like Slack , Discourse , or StackOverflow is an option we all have.

Lesson 2: Software Engineering Practices

With respect to the environment and incentives in industry surrounding code maintainability, robustness, and testing, there are certain practices in place to encourage, enable, and ensure these qualities are met. These key practices can turn a collection of scripts into a fully implemented package one can use and rely upon with high confidence.

While there are without doubt many universities and courses that teach these practices to their students, I find they are often neglected by coding novices and academics alike, to their own disadvantage.

Take version control seriously: Git is a programming staple for version control, and while it is tempting to disregard it when working alone, without it you soon find yourself creating convoluted naming schemes for your files; frequently losing track of progress; and wasting time looking through email attachments for the older version of the code to replace the one you just messed up.

Git can be a little intimidating to get started, but once you are comfortable with the basic commands (fetch, add, commit, push, pull, merge) and a few others (checkout, rebase, reset) you will never look back. GitHub ’s utility, meanwhile, extends far beyond that of a programmatic hosting service; it provides documentation hosting , CI/CD pipelines , and many other features that enable efficient cross-party collaboration on an enterprise scale.

It cannot be overstated how truly indispensable Git and GitHub are when it comes to turning your code into functional packages, and the earlier you adopt these the better. It also helps to know how semantic versioning works, so you will know what it means to increment a package version from 1.2.3 to 1.3 and why.

Organise your code : In terms of packaging your code, get to know the typical package folder structure. Packages often contain src, docs, and test directories, as well as standard artefacts like a README, to explain what the package is about, and a list of dependencies, e.g. Project and Manifest files in Julia, or requirements.txt in Python. Implementing the familiar package structure keeps things organised and enables yourself and other users to navigate the contents more easily.

Practice code hygiene : This relates to the readability and maintainability of the code itself. It’s important to practice good hygiene if you want your code to be used, extended, and maintained by others. Bad code hygiene will turn off other contributors—and eventually yourself—leaving the package unused and unmaintained. Here are some tips for ensuring good hygiene:

  • Take a design-first approach when creating your package. Think about the intended user(s) and what their requirements are—this may be others in your research group or your future self. Sometimes this can be difficult to know in advance but working iteratively is better than trying to capture all possible use cases at once.
  • Think about how the API should work and how it integrates with other packages or applications. Are you building on something that already exists or is your package creating something entirely new?
  • There should be a style guide for writing in the language, for example, BlueStyle in Julia and PEP 8 in Python. You should adhere to it so that your code follows the same standard as everyone else.
  • Give your variables and functions meaningful, and memorable names. There is no advantage to obfuscating your code for the sake of brevity.
  • Furthermore, read up on the language’s Design Patterns . These are the common approaches or techniques used in the language, which you will recognise from reading the code. These will help you write better, more idiomatic code.

Write good documentation : The greatest package ever written would never be used if nobody knew how it worked. At the very least your code should be commented and a README accompanying the package explaining to your users (and your future self) what it does and how to install and use it. You should also attach docstrings to all user-facing (aka public) functions to explain what they do, what inputs they take, what data types they return, etc. This also applies to some internal functions, to remind maintainers (including you) what they do and how they are used. Some minimum working examples of how to use the package features are also a welcome addition.

Lastly, documentation should evolve with the package; when the API changes or new use-cases get added these should be reflected in the latest documentation.

Write good tests : Researchers in computational fields might find familiar the practice of running “canonical experiments” or “reproducibility tests” that check if the code produces the correct result for some pipeline and is therefore “calibrated”. But these don’t necessarily provide good or meaningful test coverage . For instance, canonical experiments, by definition, test the software within the limits of its intended use. This will not reveal latent bugs that only manifest under certain conditions, e.g. when encountering corner cases.

To capture these you need to write adequate Unit and Integration Tests that cover all expected corner cases to be reasonably sure your code is doing what it should. Even then you can’t guarantee there isn’t a corner case you haven’t considered, but testing certainly helps.

If you do catch a bug it’s not enough to fix it and call it a day; you need to write a new test to replicate it and you will only have fixed the bug only when that new test passes. This new test prevents regressions in behaviour if the bug ever returns.

Lesson 3: Take Part in the Community

Undertaking a fraction of the points above would be more than enough to boost your ability to develop software. But the return on investment is compounded by taking part in the community forums on Slack and Discourse ; joining organizations on GitHub ; and attending Meetups and conferences . Taking part in a collaboration (and meeting your co-developers) fosters a strong sense of community that supports continual learning and encouragement to go and do great things. In smaller communities related to a particular tool or niche language, you may even become well-known such that your potential future employer (or some of their engineers) are already familiar with who you are before you apply.

Personal experience has taught me that the incentives in academic research can be qualitatively different from those in industry, despite the overlap they share. However, the practices that are instilled in one track don’t necessarily translate off-the-shelf to the other, and switching gears between these (often competing) frameworks can initially induce an all-too-familiar sense of imposter syndrome .

It’s important to remember that what you learn and internalise in a PhD is, in a sense, “selected for” according to the incentives of that environment, as outlined above. However, under the auspices of a supportive community and the proper guidelines, it’s possible to become more well-rounded in your skillset, as I have. And while I still have much more to learn, it’s encouraging to reflect on what I have learned during my time at Invenia and share it with others.

Although this post could not possibly relay everything there is to know about software engineering, my hope is that simply being exposed to the lexicon will serve as a springboard to further learning. To those looking down such a path, I say: you will make many many mistakes, as one always does at the outset of a new venture, but that’s all part of learning.

While these tips are language-agnostic, they would be particularly helpful for anyone interested in learning or improving with Julia .  ↩

Examples of high quality packages include the Requests in Python, and NamedDims.jl in Julia.  ↩

Related Posts

Deprecating in julia 17 jun 2022, using meta-optimization for predicting solutions to optimal power flow 17 dec 2021, using neural networks for predicting solutions to optimal power flow 11 oct 2021.

Grad Coach

Research Topics & Ideas: CompSci & IT

50+ Computer Science Research Topic Ideas To Fast-Track Your Project

IT & Computer Science Research Topics

Finding and choosing a strong research topic is the critical first step when it comes to crafting a high-quality dissertation, thesis or research project. If you’ve landed on this post, chances are you’re looking for a computer science-related research topic , but aren’t sure where to start. Here, we’ll explore a variety of CompSci & IT-related research ideas and topic thought-starters, including algorithms, AI, networking, database systems, UX, information security and software engineering.

NB – This is just the start…

The topic ideation and evaluation process has multiple steps . In this post, we’ll kickstart the process by sharing some research topic ideas within the CompSci domain. This is the starting point, but to develop a well-defined research topic, you’ll need to identify a clear and convincing research gap , along with a well-justified plan of action to fill that gap.

If you’re new to the oftentimes perplexing world of research, or if this is your first time undertaking a formal academic research project, be sure to check out our free dissertation mini-course. In it, we cover the process of writing a dissertation or thesis from start to end. Be sure to also sign up for our free webinar that explores how to find a high-quality research topic. 

Overview: CompSci Research Topics

  • Algorithms & data structures
  • Artificial intelligence ( AI )
  • Computer networking
  • Database systems
  • Human-computer interaction
  • Information security (IS)
  • Software engineering
  • Examples of CompSci dissertation & theses

Topics/Ideas: Algorithms & Data Structures

  • An analysis of neural network algorithms’ accuracy for processing consumer purchase patterns
  • A systematic review of the impact of graph algorithms on data analysis and discovery in social media network analysis
  • An evaluation of machine learning algorithms used for recommender systems in streaming services
  • A review of approximation algorithm approaches for solving NP-hard problems
  • An analysis of parallel algorithms for high-performance computing of genomic data
  • The influence of data structures on optimal algorithm design and performance in Fintech
  • A Survey of algorithms applied in internet of things (IoT) systems in supply-chain management
  • A comparison of streaming algorithm performance for the detection of elephant flows
  • A systematic review and evaluation of machine learning algorithms used in facial pattern recognition
  • Exploring the performance of a decision tree-based approach for optimizing stock purchase decisions
  • Assessing the importance of complete and representative training datasets in Agricultural machine learning based decision making.
  • A Comparison of Deep learning algorithms performance for structured and unstructured datasets with “rare cases”
  • A systematic review of noise reduction best practices for machine learning algorithms in geoinformatics.
  • Exploring the feasibility of applying information theory to feature extraction in retail datasets.
  • Assessing the use case of neural network algorithms for image analysis in biodiversity assessment

Topics & Ideas: Artificial Intelligence (AI)

  • Applying deep learning algorithms for speech recognition in speech-impaired children
  • A review of the impact of artificial intelligence on decision-making processes in stock valuation
  • An evaluation of reinforcement learning algorithms used in the production of video games
  • An exploration of key developments in natural language processing and how they impacted the evolution of Chabots.
  • An analysis of the ethical and social implications of artificial intelligence-based automated marking
  • The influence of large-scale GIS datasets on artificial intelligence and machine learning developments
  • An examination of the use of artificial intelligence in orthopaedic surgery
  • The impact of explainable artificial intelligence (XAI) on transparency and trust in supply chain management
  • An evaluation of the role of artificial intelligence in financial forecasting and risk management in cryptocurrency
  • A meta-analysis of deep learning algorithm performance in predicting and cyber attacks in schools

Research topic idea mega list

Topics & Ideas: Networking

  • An analysis of the impact of 5G technology on internet penetration in rural Tanzania
  • Assessing the role of software-defined networking (SDN) in modern cloud-based computing
  • A critical analysis of network security and privacy concerns associated with Industry 4.0 investment in healthcare.
  • Exploring the influence of cloud computing on security risks in fintech.
  • An examination of the use of network function virtualization (NFV) in telecom networks in Southern America
  • Assessing the impact of edge computing on network architecture and design in IoT-based manufacturing
  • An evaluation of the challenges and opportunities in 6G wireless network adoption
  • The role of network congestion control algorithms in improving network performance on streaming platforms
  • An analysis of network coding-based approaches for data security
  • Assessing the impact of network topology on network performance and reliability in IoT-based workspaces

Free Webinar: How To Find A Dissertation Research Topic

Topics & Ideas: Database Systems

  • An analysis of big data management systems and technologies used in B2B marketing
  • The impact of NoSQL databases on data management and analysis in smart cities
  • An evaluation of the security and privacy concerns of cloud-based databases in financial organisations
  • Exploring the role of data warehousing and business intelligence in global consultancies
  • An analysis of the use of graph databases for data modelling and analysis in recommendation systems
  • The influence of the Internet of Things (IoT) on database design and management in the retail grocery industry
  • An examination of the challenges and opportunities of distributed databases in supply chain management
  • Assessing the impact of data compression algorithms on database performance and scalability in cloud computing
  • An evaluation of the use of in-memory databases for real-time data processing in patient monitoring
  • Comparing the effects of database tuning and optimization approaches in improving database performance and efficiency in omnichannel retailing

Topics & Ideas: Human-Computer Interaction

  • An analysis of the impact of mobile technology on human-computer interaction prevalence in adolescent men
  • An exploration of how artificial intelligence is changing human-computer interaction patterns in children
  • An evaluation of the usability and accessibility of web-based systems for CRM in the fast fashion retail sector
  • Assessing the influence of virtual and augmented reality on consumer purchasing patterns
  • An examination of the use of gesture-based interfaces in architecture
  • Exploring the impact of ease of use in wearable technology on geriatric user
  • Evaluating the ramifications of gamification in the Metaverse
  • A systematic review of user experience (UX) design advances associated with Augmented Reality
  • A comparison of natural language processing algorithms automation of customer response Comparing end-user perceptions of natural language processing algorithms for automated customer response
  • Analysing the impact of voice-based interfaces on purchase practices in the fast food industry

Research Topic Kickstarter - Need Help Finding A Research Topic?

Topics & Ideas: Information Security

  • A bibliometric review of current trends in cryptography for secure communication
  • An analysis of secure multi-party computation protocols and their applications in cloud-based computing
  • An investigation of the security of blockchain technology in patient health record tracking
  • A comparative study of symmetric and asymmetric encryption algorithms for instant text messaging
  • A systematic review of secure data storage solutions used for cloud computing in the fintech industry
  • An analysis of intrusion detection and prevention systems used in the healthcare sector
  • Assessing security best practices for IoT devices in political offices
  • An investigation into the role social media played in shifting regulations related to privacy and the protection of personal data
  • A comparative study of digital signature schemes adoption in property transfers
  • An assessment of the security of secure wireless communication systems used in tertiary institutions

Topics & Ideas: Software Engineering

  • A study of agile software development methodologies and their impact on project success in pharmacology
  • Investigating the impacts of software refactoring techniques and tools in blockchain-based developments
  • A study of the impact of DevOps practices on software development and delivery in the healthcare sector
  • An analysis of software architecture patterns and their impact on the maintainability and scalability of cloud-based offerings
  • A study of the impact of artificial intelligence and machine learning on software engineering practices in the education sector
  • An investigation of software testing techniques and methodologies for subscription-based offerings
  • A review of software security practices and techniques for protecting against phishing attacks from social media
  • An analysis of the impact of cloud computing on the rate of software development and deployment in the manufacturing sector
  • Exploring the impact of software development outsourcing on project success in multinational contexts
  • An investigation into the effect of poor software documentation on app success in the retail sector

CompSci & IT Dissertations/Theses

While the ideas we’ve presented above are a decent starting point for finding a CompSci-related research topic, they are fairly generic and non-specific. So, it helps to look at actual dissertations and theses to see how this all comes together.

Below, we’ve included a selection of research projects from various CompSci-related degree programs to help refine your thinking. These are actual dissertations and theses, written as part of Master’s and PhD-level programs, so they can provide some useful insight as to what a research topic looks like in practice.

  • An array-based optimization framework for query processing and data analytics (Chen, 2021)
  • Dynamic Object Partitioning and replication for cooperative cache (Asad, 2021)
  • Embedding constructural documentation in unit tests (Nassif, 2019)
  • PLASA | Programming Language for Synchronous Agents (Kilaru, 2019)
  • Healthcare Data Authentication using Deep Neural Network (Sekar, 2020)
  • Virtual Reality System for Planetary Surface Visualization and Analysis (Quach, 2019)
  • Artificial neural networks to predict share prices on the Johannesburg stock exchange (Pyon, 2021)
  • Predicting household poverty with machine learning methods: the case of Malawi (Chinyama, 2022)
  • Investigating user experience and bias mitigation of the multi-modal retrieval of historical data (Singh, 2021)
  • Detection of HTTPS malware traffic without decryption (Nyathi, 2022)
  • Redefining privacy: case study of smart health applications (Al-Zyoud, 2019)
  • A state-based approach to context modeling and computing (Yue, 2019)
  • A Novel Cooperative Intrusion Detection System for Mobile Ad Hoc Networks (Solomon, 2019)
  • HRSB-Tree for Spatio-Temporal Aggregates over Moving Regions (Paduri, 2019)

Looking at these titles, you can probably pick up that the research topics here are quite specific and narrowly-focused , compared to the generic ones presented earlier. This is an important thing to keep in mind as you develop your own research topic. That is to say, to create a top-notch research topic, you must be precise and target a specific context with specific variables of interest . In other words, you need to identify a clear, well-justified research gap.

Fast-Track Your Research Topic

If you’re still feeling a bit unsure about how to find a research topic for your Computer Science dissertation or research project, check out our Topic Kickstarter service.

You Might Also Like:

Research topics and ideas about data science and big data analytics

Investigating the impacts of software refactoring techniques and tools in blockchain-based developments.

Steps on getting this project topic

Joseph

I want to work with this topic, am requesting materials to guide.

Yadessa Dugassa

Information Technology -MSc program

Andrew Itodo

It’s really interesting but how can I have access to the materials to guide me through my work?

Sorie A. Turay

That’s my problem also.

kumar

Investigating the impacts of software refactoring techniques and tools in blockchain-based developments is in my favour. May i get the proper material about that ?

BEATRICE OSAMEGBE

BLOCKCHAIN TECHNOLOGY

Nanbon Temasgen

I NEED TOPIC

Submit a Comment Cancel reply

Your email address will not be published. Required fields are marked *

Save my name, email, and website in this browser for the next time I comment.

  • Print Friendly
  • Frontiers in Computer Science
  • Research Topics

Software Engineering and Intelligent Systems

Total Downloads

Total Views and Downloads

About this Research Topic

Software engineering and intelligent systems are two dynamic and interrelated fields that have witnessed significant advancements and transformations in recent years. The convergence of these domains has led to the development of innovative applications and solutions that are shaping various industries, from healthcare and finance to transportation and manufacturing. This Research Topic aims to serve as a platform for researchers and practitioners to share their insights, findings, and innovations in this exciting and rapidly evolving intersection of software engineering and intelligent systems. Intelligent systems, often powered by artificial intelligence (AI) and machine learning (ML), have seen remarkable growth and adoption in various domains. We invite researchers and practitioners from academia and industry to contribute their original work to this article collection. Topics of interest include but are not limited to the following: - AI in Software Development; - Adaptive and Self-Learning Software; - Data Science and Big Data in Software Engineering; - Software Engineering Methodologies; - Artificial Intelligence and Machine Learning; - Case Studies and Industry Applications. Extended versions of work presented at the “Austrian Conference on Research at Universities of Applied Sciences (FHK Konferenz 2024)” are particularly welcome to submit to this journal. Extended versions should have at least 30% novel content.

Keywords : Software Engineering, Artificial Intelligence, Intelligent Systems, Data Science, Machine Learning

Important Note : All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements. Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of peer review.

Topic Editors

Topic coordinators, submission deadlines, participating journals.

Manuscripts can be submitted to this Research Topic via the following journals:

total views

  • Demographics

No records found

total views article views downloads topic views

Top countries

Top referring sites, about frontiers research topics.

With their unique mixes of varied contributions from Original Research to Review Articles, Research Topics unify the most influential researchers, the latest key findings and historical advances in a hot research area! Find out more on how to host your own Frontiers Research Topic or contribute to one as an author.

  • Search Search for:
  • Architecture
  • Military Tech
  • DIY Projects

Wonderful Engineering

Software Engineer Research Paper Topics 2021: Top 5

best research topics software engineering

Whether you’re studying in advance or you’re close to getting that Software Engineering degree, it’s crucial that you look for possible research paper topics in advance. This will help you have an advantage in your course.

First off, remember that software engineering revolves around tech development and improvement.

Hence, your research paper should have the same goal. It shouldn’t be too complex so that you can go through it smoothly. At the same time, it shouldn’t be too easy to the point that it can be looked up online.

Choosing can be a difficult task. Students are often choosing buy assignment from a professional writer because of the wrong topic choice. Thus, to help you land on the best topic for your needs, we have listed the top 5 software engineer research paper topics in the next sections.

Machine Learning

Machine learning is one of the most used research topics of software engineers. If you’re not yet familiar with this, it’s a field that revolves around producing programs that improve its algorithm on its own just by the use of existing data and experience.

Basically, the art of machine learning aims to make intelligent tools. Here, you will need to use various statistical methods for your computers’ algorithms. This somehow makes it a complex and long topic.

Even so, the good thing about the said field is it covers a lot of subtopics. These can include using machine learning for face spoof detection, iris detection, sentiment analysis technique, and likes. Usually, though, machine learning will go hand in hand with certain detection systems.

Artificial Intelligence

Artificial Intelligence is a much easier concept than machine learning. Note, though, that the latter is just another type of AI tool.

AI refers to the human-like intelligence integrated into machines and computer programs. Focusing on this will give you much more topics to write about. Since it’s present in a lot of fields like gaming, marketing, and even random automated tasks, you will have more materials to refer to.

Some things that you can write about in your paper include AI’s relationship with software engineering, robotics, and natural processing. You can also write about the different types of artificial intelligence tools for a more guided research paper.

Internet Of Things

Another topic that you can write about is the Internet of Things, or more commonly known as IoT . This refers to interconnected devices, machines, or even living beings as long as a network exists.

Writing about IoT will open a huge array of possibilities to write about. You can talk about whether the topic is a problem that needs additional solutions or improvements. At the same time, you will be able to talk about specific machine requirements since IoT works mainly with communication servers.

In addition, the concept of the Internet of Things is also used in several fields like agriculture, e-commerce, and medicine. Because of this, you can rest assured that you won’t run out of things to talk about or refer to.

Software Development Models

Next up, we have software development models. If you want to write about a research paper(or maybe you decided to purchase custom research paper ?) relating to how one can start building an app or software, then using software development models as a topic is a good choice.

Here, you can choose to write about what the concept is or delve deeper into its different types. You can look into the Waterfall Model, V-Model, Incremental, RAD, Agile, Iterative, Spiral, and Prototype. You can choose either one or all of the models and then relate them to software engineering.

Clone Management

One of the most important elements in software engineering is the clone base. Hence, using this as a research topic will help you stay relevant to your course and its needs. In particular, you can focus on clone management.

Clone management is a task that revolves around ensuring that a database is free from error and duplicated codes. What makes this a good topic is its materials are still limited in the field of software engineering. This is compared to other clone-related topics. Hence, you can ensure a distinct topic for your paper.

To land on the best topic, take your interest into account. Look for the field that makes you curious and entertained. In this way, you can build motivation to actually know more about it, and not just for the sake of submitting.

Another good tip is to choose a unique topic. The ones we discussed above can be considered unique since they are some of the latest software-related topics. If you’re going to use a common one, then make sure that you put your own little twist to it. You can also consider seeing the topic in a different light.

Anyhow, your research paper, its grade, and overall quality will greatly depend on what you choose to write about.

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

Notify me of follow-up comments by email.

Notify me of new posts by email.

best research topics software engineering

Princeton University

  • Advisers & Contacts
  • Bachelor of Arts & Bachelor of Science in Engineering
  • Prerequisites
  • Declaring Computer Science for AB Students
  • Declaring Computer Science for BSE Students
  • Class of '25, '26 & '27 - Departmental Requirements
  • Class of 2024 - Departmental Requirements
  • COS126 Information
  • Important Steps and Deadlines
  • Independent Work Seminars
  • Guidelines and Useful Information

Undergraduate Research Topics

  • AB Junior Research Workshops
  • Undergraduate Program FAQ
  • How to Enroll
  • Requirements
  • Certificate Program FAQ
  • Interdepartmental Committee
  • Minor Program
  • Funding for Student Group Activities
  • Mailing Lists and Policies
  • Study Abroad
  • Jobs & Careers
  • Admissions Requirements
  • Breadth Requirements
  • Pre-FPO Checklist
  • FPO Checklist
  • M.S.E. Track
  • M.Eng. Track
  • Departmental Internship Policy (for Master's students)
  • General Examination
  • Fellowship Opportunities
  • Travel Reimbursement Policy
  • Communication Skills
  • Course Schedule
  • Course Catalog
  • Research Areas
  • Interdisciplinary Programs
  • Technical Reports
  • Computing Facilities
  • Researchers
  • Technical Staff
  • Administrative Staff
  • Graduate Students
  • Undergraduate Students
  • Graduate Alumni
  • Climate and Inclusion Committee
  • Resources for Undergraduate & Graduate Students
  • Outreach Initiatives
  • Resources for Faculty & Staff
  • Spotlight Stories
  • Job Openings
  • Undergraduate Program
  • Independent Work & Theses

Suggested Undergraduate Research Topics

best research topics software engineering

How to Contact Faculty for IW/Thesis Advising

Send the professor an e-mail. When you write a professor, be clear that you want a meeting regarding a senior thesis or one-on-one IW project, and briefly describe the topic or idea that you want to work on. Check the faculty listing for email addresses.

Parastoo Abtahi, Room 419

Available for single-semester IW and senior thesis advising, 2024-2025

  • Research Areas: Human-Computer Interaction (HCI), Augmented Reality (AR), and Spatial Computing
  • Input techniques for on-the-go interaction (e.g., eye-gaze, microgestures, voice) with a focus on uncertainty, disambiguation, and privacy.
  • Minimal and timely multisensory output (e.g., spatial audio, haptics) that enables users to attend to their physical environment and the people around them, instead of a 2D screen.
  • Interaction with intelligent systems (e.g., IoT, robots) situated in physical spaces with a focus on updating users’ mental model despite the complexity and dynamicity of these systems.

Ryan Adams, Room 411

Research areas:

  • Machine learning driven design
  • Generative models for structured discrete objects
  • Approximate inference in probabilistic models
  • Accelerating solutions to partial differential equations
  • Innovative uses of automatic differentiation
  • Modeling and optimizing 3d printing and CNC machining

Andrew Appel, Room 209

Available for Fall 2024 IW advising, only

  • Research Areas: Formal methods, programming languages, compilers, computer security.
  • Software verification (for which taking COS 326 / COS 510 is helpful preparation)
  • Game theory of poker or other games (for which COS 217 / 226 are helpful)
  • Computer game-playing programs (for which COS 217 / 226)
  •  Risk-limiting audits of elections (for which ORF 245 or other knowledge of probability is useful)

Sanjeev Arora, Room 407

  • Theoretical machine learning, deep learning and its analysis, natural language processing. My advisees would typically have taken a course in algorithms (COS423 or COS 521 or equivalent) and a course in machine learning.
  • Show that finding approximate solutions to NP-complete problems is also NP-complete (i.e., come up with NP-completeness reductions a la COS 487). 
  • Experimental Algorithms: Implementing and Evaluating Algorithms using existing software packages. 
  • Studying/designing provable algorithms for machine learning and implementions using packages like scipy and MATLAB, including applications in Natural language processing and deep learning.
  • Any topic in theoretical computer science.

David August, Room 221

Not available for IW or thesis advising, 2024-2025

  • Research Areas: Computer Architecture, Compilers, Parallelism
  • Containment-based approaches to security:  We have designed and tested a simple hardware+software containment mechanism that stops incorrect communication resulting from faults, bugs, or exploits from leaving the system.   Let's explore ways to use containment to solve real problems.  Expect to work with corporate security and technology decision-makers.
  • Parallelism: Studies show much more parallelism than is currently realized in compilers and architectures.  Let's find ways to realize this parallelism.
  • Any other interesting topic in computer architecture or compilers. 

Mark Braverman, 194 Nassau St., Room 231

  • Research Areas: computational complexity, algorithms, applied probability, computability over the real numbers, game theory and mechanism design, information theory.
  • Topics in computational and communication complexity.
  • Applications of information theory in complexity theory.
  • Algorithms for problems under real-life assumptions.
  • Game theory, network effects
  • Mechanism design (could be on a problem proposed by the student)

Sebastian Caldas, 221 Nassau Street, Room 105

  • Research Areas: collaborative learning, machine learning for healthcare. Typically, I will work with students that have taken COS324.
  • Methods for collaborative and continual learning.
  • Machine learning for healthcare applications.

Bernard Chazelle, 194 Nassau St., Room 301

  • Research Areas: Natural Algorithms, Computational Geometry, Sublinear Algorithms. 
  • Natural algorithms (flocking, swarming, social networks, etc).
  • Sublinear algorithms
  • Self-improving algorithms
  • Markov data structures

Danqi Chen, Room 412

  • My advisees would be expected to have taken a course in machine learning and ideally have taken COS484 or an NLP graduate seminar.
  • Representation learning for text and knowledge bases
  • Pre-training and transfer learning
  • Question answering and reading comprehension
  • Information extraction
  • Text summarization
  • Any other interesting topics related to natural language understanding/generation

Marcel Dall'Agnol, Corwin 034

  • Research Areas: Theoretical computer science. (Specifically, quantum computation, sublinear algorithms, complexity theory, interactive proofs and cryptography)
  • Research Areas: Machine learning

Jia Deng, Room 423

  •  Research Areas: Computer Vision, Machine Learning.
  • Object recognition and action recognition
  • Deep Learning, autoML, meta-learning
  • Geometric reasoning, logical reasoning

Adji Bousso Dieng, Room 406

  • Research areas: Vertaix is a research lab at Princeton University led by Professor Adji Bousso Dieng. We work at the intersection of artificial intelligence (AI) and the natural sciences. The models and algorithms we develop are motivated by problems in those domains and contribute to advancing methodological research in AI. We leverage tools in statistical machine learning and deep learning in developing methods for learning with the data, of various modalities, arising from the natural sciences.

Robert Dondero, Corwin Hall, Room 038

  • Research Areas:  Software engineering; software engineering education.
  • Develop or evaluate tools to facilitate student learning in undergraduate computer science courses at Princeton, and beyond.
  • In particular, can code critiquing tools help students learn about software quality?

Zeev Dvir, 194 Nassau St., Room 250

  • Research Areas: computational complexity, pseudo-randomness, coding theory and discrete mathematics.
  • Independent Research: I have various research problems related to Pseudorandomness, Coding theory, Complexity and Discrete mathematics - all of which require strong mathematical background. A project could also be based on writing a survey paper describing results from a few theory papers revolving around some particular subject.

Benjamin Eysenbach, Room 416

  • Research areas: reinforcement learning, machine learning. My advisees would typically have taken COS324.
  • Using RL algorithms to applications in science and engineering.
  • Emergent behavior of RL algorithms on high-fidelity robotic simulators.
  • Studying how architectures and representations can facilitate generalization.

Christiane Fellbaum, 1-S-14 Green

  • Research Areas: theoretical and computational linguistics, word sense disambiguation, lexical resource construction, English and multilingual WordNet(s), ontology
  • Anything having to do with natural language--come and see me with/for ideas suitable to your background and interests. Some topics students have worked on in the past:
  • Developing parsers, part-of-speech taggers, morphological analyzers for underrepresented languages (you don't have to know the language to develop such tools!)
  • Quantitative approaches to theoretical linguistics questions
  • Extensions and interfaces for WordNet (English and WN in other languages),
  • Applications of WordNet(s), including:
  • Foreign language tutoring systems,
  • Spelling correction software,
  • Word-finding/suggestion software for ordinary users and people with memory problems,
  • Machine Translation 
  • Sentiment and Opinion detection
  • Automatic reasoning and inferencing
  • Collaboration with professors in the social sciences and humanities ("Digital Humanities")

Adam Finkelstein, Room 424 

  • Research Areas: computer graphics, audio.

Robert S. Fish, Corwin Hall, Room 037

  • Networking and telecommunications
  • Learning, perception, and intelligence, artificial and otherwise;
  • Human-computer interaction and computer-supported cooperative work
  • Online education, especially in Computer Science Education
  • Topics in research and development innovation methodologies including standards, open-source, and entrepreneurship
  • Distributed autonomous organizations and related blockchain technologies

Michael Freedman, Room 308 

  • Research Areas: Distributed systems, security, networking
  • Projects related to streaming data analysis, datacenter systems and networks, untrusted cloud storage and applications. Please see my group website at http://sns.cs.princeton.edu/ for current research projects.

Ruth Fong, Room 032

  • Research Areas: computer vision, machine learning, deep learning, interpretability, explainable AI, fairness and bias in AI
  • Develop a technique for understanding AI models
  • Design a AI model that is interpretable by design
  • Build a paradigm for detecting and/or correcting failure points in an AI model
  • Analyze an existing AI model and/or dataset to better understand its failure points
  • Build a computer vision system for another domain (e.g., medical imaging, satellite data, etc.)
  • Develop a software package for explainable AI
  • Adapt explainable AI research to a consumer-facing problem

Note: I am happy to advise any project if there's a sufficient overlap in interest and/or expertise; please reach out via email to chat about project ideas.

Tom Griffiths, Room 405

Available for Fall 2024 single-semester IW advising, only

Research areas: computational cognitive science, computational social science, machine learning and artificial intelligence

Note: I am open to projects that apply ideas from computer science to understanding aspects of human cognition in a wide range of areas, from decision-making to cultural evolution and everything in between. For example, we have current projects analyzing chess game data and magic tricks, both of which give us clues about how human minds work. Students who have expertise or access to data related to games, magic, strategic sports like fencing, or other quantifiable domains of human behavior feel free to get in touch.

Aarti Gupta, Room 220

  • Research Areas: Formal methods, program analysis, logic decision procedures
  • Finding bugs in open source software using automatic verification tools
  • Software verification (program analysis, model checking, test generation)
  • Decision procedures for logical reasoning (SAT solvers, SMT solvers)

Elad Hazan, Room 409  

  • Research interests: machine learning methods and algorithms, efficient methods for mathematical optimization, regret minimization in games, reinforcement learning, control theory and practice
  • Machine learning, efficient methods for mathematical optimization, statistical and computational learning theory, regret minimization in games.
  • Implementation and algorithm engineering for control, reinforcement learning and robotics
  • Implementation and algorithm engineering for time series prediction

Felix Heide, Room 410

  • Research Areas: Computational Imaging, Computer Vision, Machine Learning (focus on Optimization and Approximate Inference).
  • Optical Neural Networks
  • Hardware-in-the-loop Holography
  • Zero-shot and Simulation-only Learning
  • Object recognition in extreme conditions
  • 3D Scene Representations for View Generation and Inverse Problems
  • Long-range Imaging in Scattering Media
  • Hardware-in-the-loop Illumination and Sensor Optimization
  • Inverse Lidar Design
  • Phase Retrieval Algorithms
  • Proximal Algorithms for Learning and Inference
  • Domain-Specific Language for Optics Design

Peter Henderson , 302 Sherrerd Hall

  • Research Areas: Machine learning, law, and policy

Kyle Jamieson, Room 306

  • Research areas: Wireless and mobile networking; indoor radar and indoor localization; Internet of Things
  • See other topics on my independent work  ideas page  (campus IP and CS dept. login req'd)

Alan Kaplan, 221 Nassau Street, Room 105

Research Areas:

  • Random apps of kindness - mobile application/technology frameworks used to help individuals or communities; topic areas include, but are not limited to: first response, accessibility, environment, sustainability, social activism, civic computing, tele-health, remote learning, crowdsourcing, etc.
  • Tools automating programming language interoperability - Java/C++, React Native/Java, etc.
  • Software visualization tools for education
  • Connected consumer devices, applications and protocols

Brian Kernighan, Room 311

  • Research Areas: application-specific languages, document preparation, user interfaces, software tools, programming methodology
  • Application-oriented languages, scripting languages.
  • Tools; user interfaces
  • Digital humanities

Zachary Kincaid, Room 219

  • Research areas: programming languages, program analysis, program verification, automated reasoning
  • Independent Research Topics:
  • Develop a practical algorithm for an intractable problem (e.g., by developing practical search heuristics, or by reducing to, or by identifying a tractable sub-problem, ...).
  • Design a domain-specific programming language, or prototype a new feature for an existing language.
  • Any interesting project related to programming languages or logic.

Gillat Kol, Room 316

  • Research area: theory

Aleksandra Korolova, 309 Sherrerd Hall

  • Research areas: Societal impacts of algorithms and AI; privacy; fair and privacy-preserving machine learning; algorithm auditing.

Advisees typically have taken one or more of COS 226, COS 324, COS 423, COS 424 or COS 445.

Pravesh Kothari, Room 320

  • Research areas: Theory

Amit Levy, Room 307

  • Research Areas: Operating Systems, Distributed Systems, Embedded Systems, Internet of Things
  • Distributed hardware testing infrastructure
  • Second factor security tokens
  • Low-power wireless network protocol implementation
  • USB device driver implementation

Kai Li, Room 321

  • Research Areas: Distributed systems; storage systems; content-based search and data analysis of large datasets.
  • Fast communication mechanisms for heterogeneous clusters.
  • Approximate nearest-neighbor search for high dimensional data.
  • Data analysis and prediction of in-patient medical data.
  • Optimized implementation of classification algorithms on manycore processors.

Xiaoyan Li, 221 Nassau Street, Room 104

  • Research areas: Information retrieval, novelty detection, question answering, AI, machine learning and data analysis.
  • Explore new statistical retrieval models for document retrieval and question answering.
  • Apply AI in various fields.
  • Apply supervised or unsupervised learning in health, education, finance, and social networks, etc.
  • Any interesting project related to AI, machine learning, and data analysis.

Lydia Liu, Room 414

  • Research Areas: algorithmic decision making, machine learning and society
  • Theoretical foundations for algorithmic decision making (e.g. mathematical modeling of data-driven decision processes, societal level dynamics)
  • Societal impacts of algorithms and AI through a socio-technical lens (e.g. normative implications of worst case ML metrics, prediction and model arbitrariness)
  • Machine learning for social impact domains, especially education (e.g. responsible development and use of LLMs for education equity and access)
  • Evaluation of human-AI decision making using statistical methods (e.g. causal inference of long term impact)

Wyatt Lloyd, Room 323

  • Research areas: Distributed Systems
  • Caching algorithms and implementations
  • Storage systems
  • Distributed transaction algorithms and implementations

Alex Lombardi , Room 312

  • Research Areas: Theory

Margaret Martonosi, Room 208

  • Quantum Computing research, particularly related to architecture and compiler issues for QC.
  • Computer architectures specialized for modern workloads (e.g., graph analytics, machine learning algorithms, mobile applications
  • Investigating security and privacy vulnerabilities in computer systems, particularly IoT devices.
  • Other topics in computer architecture or mobile / IoT systems also possible.

Jonathan Mayer, Sherrerd Hall, Room 307 

Available for Spring 2025 single-semester IW, only

  • Research areas: Technology law and policy, with emphasis on national security, criminal procedure, consumer privacy, network management, and online speech.
  • Assessing the effects of government policies, both in the public and private sectors.
  • Collecting new data that relates to government decision making, including surveying current business practices and studying user behavior.
  • Developing new tools to improve government processes and offer policy alternatives.

Mae Milano, Room 307

  • Local-first / peer-to-peer systems
  • Wide-ares storage systems
  • Consistency and protocol design
  • Type-safe concurrency
  • Language design
  • Gradual typing
  • Domain-specific languages
  • Languages for distributed systems

Andrés Monroy-Hernández, Room 405

  • Research Areas: Human-Computer Interaction, Social Computing, Public-Interest Technology, Augmented Reality, Urban Computing
  • Research interests:developing public-interest socio-technical systems.  We are currently creating alternatives to gig work platforms that are more equitable for all stakeholders. For instance, we are investigating the socio-technical affordances necessary to support a co-op food delivery network owned and managed by workers and restaurants. We are exploring novel system designs that support self-governance, decentralized/federated models, community-centered data ownership, and portable reputation systems.  We have opportunities for students interested in human-centered computing, UI/UX design, full-stack software development, and qualitative/quantitative user research.
  • Beyond our core projects, we are open to working on research projects that explore the use of emerging technologies, such as AR, wearables, NFTs, and DAOs, for creative and out-of-the-box applications.

Christopher Moretti, Corwin Hall, Room 036

  • Research areas: Distributed systems, high-throughput computing, computer science/engineering education
  • Expansion, improvement, and evaluation of open-source distributed computing software.
  • Applications of distributed computing for "big science" (e.g. biometrics, data mining, bioinformatics)
  • Software and best practices for computer science education and study, especially Princeton's 126/217/226 sequence or MOOCs development
  • Sports analytics and/or crowd-sourced computing

Radhika Nagpal, F316 Engineering Quadrangle

  • Research areas: control, robotics and dynamical systems

Karthik Narasimhan, Room 422

  • Research areas: Natural Language Processing, Reinforcement Learning
  • Autonomous agents for text-based games ( https://www.microsoft.com/en-us/research/project/textworld/ )
  • Transfer learning/generalization in NLP
  • Techniques for generating natural language
  • Model-based reinforcement learning

Arvind Narayanan, 308 Sherrerd Hall 

Research Areas: fair machine learning (and AI ethics more broadly), the social impact of algorithmic systems, tech policy

Pedro Paredes, Corwin Hall, Room 041

My primary research work is in Theoretical Computer Science.

 * Research Interest: Spectral Graph theory, Pseudorandomness, Complexity theory, Coding Theory, Quantum Information Theory, Combinatorics.

The IW projects I am interested in advising can be divided into three categories:

 1. Theoretical research

I am open to advise work on research projects in any topic in one of my research areas of interest. A project could also be based on writing a survey given results from a few papers. Students should have a solid background in math (e.g., elementary combinatorics, graph theory, discrete probability, basic algebra/calculus) and theoretical computer science (226 and 240 material, like big-O/Omega/Theta, basic complexity theory, basic fundamental algorithms). Mathematical maturity is a must.

A (non exhaustive) list of topics of projects I'm interested in:   * Explicit constructions of better vertex expanders and/or unique neighbor expanders.   * Construction deterministic or random high dimensional expanders.   * Pseudorandom generators for different problems.   * Topics around the quantum PCP conjecture.   * Topics around quantum error correcting codes and locally testable codes, including constructions, encoding and decoding algorithms.

 2. Theory informed practical implementations of algorithms   Very often the great advances in theoretical research are either not tested in practice or not even feasible to be implemented in practice. Thus, I am interested in any project that consists in trying to make theoretical ideas applicable in practice. This includes coming up with new algorithms that trade some theoretical guarantees for feasible implementation yet trying to retain the soul of the original idea; implementing new algorithms in a suitable programming language; and empirically testing practical implementations and comparing them with benchmarks / theoretical expectations. A project in this area doesn't have to be in my main areas of research, any theoretical result could be suitable for such a project.

Some examples of areas of interest:   * Streaming algorithms.   * Numeric linear algebra.   * Property testing.   * Parallel / Distributed algorithms.   * Online algorithms.    3. Machine learning with a theoretical foundation

I am interested in projects in machine learning that have some mathematical/theoretical, even if most of the project is applied. This includes topics like mathematical optimization, statistical learning, fairness and privacy.

One particular area I have been recently interested in is in the area of rating systems (e.g., Chess elo) and applications of this to experts problems.

Final Note: I am also willing to advise any project with any mathematical/theoretical component, even if it's not the main one; please reach out via email to chat about project ideas.

Iasonas Petras, Corwin Hall, Room 033

  • Research Areas: Information Based Complexity, Numerical Analysis, Quantum Computation.
  • Prerequisites: Reasonable mathematical maturity. In case of a project related to Quantum Computation a certain familiarity with quantum mechanics is required (related courses: ELE 396/PHY 208).
  • Possible research topics include:

1.   Quantum algorithms and circuits:

  • i. Design or simulation quantum circuits implementing quantum algorithms.
  • ii. Design of quantum algorithms solving/approximating continuous problems (such as Eigenvalue problems for Partial Differential Equations).

2.   Information Based Complexity:

  • i. Necessary and sufficient conditions for tractability of Linear and Linear Tensor Product Problems in various settings (for example worst case or average case). 
  • ii. Necessary and sufficient conditions for tractability of Linear and Linear Tensor Product Problems under new tractability and error criteria.
  • iii. Necessary and sufficient conditions for tractability of Weighted problems.
  • iv. Necessary and sufficient conditions for tractability of Weighted Problems under new tractability and error criteria.

3. Topics in Scientific Computation:

  • i. Randomness, Pseudorandomness, MC and QMC methods and their applications (Finance, etc)

Yuri Pritykin, 245 Carl Icahn Lab

  • Research interests: Computational biology; Cancer immunology; Regulation of gene expression; Functional genomics; Single-cell technologies.
  • Potential research projects: Development, implementation, assessment and/or application of algorithms for analysis, integration, interpretation and visualization of multi-dimensional data in molecular biology, particularly single-cell and spatial genomics data.

Benjamin Raphael, Room 309  

  • Research interests: Computational biology and bioinformatics; Cancer genomics; Algorithms and machine learning approaches for analysis of large-scale datasets
  • Implementation and application of algorithms to infer evolutionary processes in cancer
  • Identifying correlations between combinations of genomic mutations in human and cancer genomes
  • Design and implementation of algorithms for genome sequencing from new DNA sequencing technologies
  • Graph clustering and network anomaly detection, particularly using diffusion processes and methods from spectral graph theory

Vikram Ramaswamy, 035 Corwin Hall

  • Research areas: Interpretability of AI systems, Fairness in AI systems, Computer vision.
  • Constructing a new method to explain a model / create an interpretable by design model
  • Analyzing a current model / dataset to understand bias within the model/dataset
  • Proposing new fairness evaluations
  • Proposing new methods to train to improve fairness
  • Developing synthetic datasets for fairness / interpretability benchmarks
  • Understanding robustness of models

Ran Raz, Room 240

  • Research Area: Computational Complexity
  • Independent Research Topics: Computational Complexity, Information Theory, Quantum Computation, Theoretical Computer Science

Szymon Rusinkiewicz, Room 406

  • Research Areas: computer graphics; computer vision; 3D scanning; 3D printing; robotics; documentation and visualization of cultural heritage artifacts
  • Research ways of incorporating rotation invariance into computer visiontasks such as feature matching and classification
  • Investigate approaches to robust 3D scan matching
  • Model and compensate for imperfections in 3D printing
  • Given a collection of small mobile robots, apply control policies learned in simulation to the real robots.

Olga Russakovsky, Room 408

  • Research Areas: computer vision, machine learning, deep learning, crowdsourcing, fairness&bias in AI
  • Design a semantic segmentation deep learning model that can operate in a zero-shot setting (i.e., recognize and segment objects not seen during training)
  • Develop a deep learning classifier that is impervious to protected attributes (such as gender or race) that may be erroneously correlated with target classes
  • Build a computer vision system for the novel task of inferring what object (or part of an object) a human is referring to when pointing to a single pixel in the image. This includes both collecting an appropriate dataset using crowdsourcing on Amazon Mechanical Turk, creating a new deep learning formulation for this task, and running extensive analysis of both the data and the model

Sebastian Seung, Princeton Neuroscience Institute, Room 153

  • Research Areas: computational neuroscience, connectomics, "deep learning" neural networks, social computing, crowdsourcing, citizen science
  • Gamification of neuroscience (EyeWire  2.0)
  • Semantic segmentation and object detection in brain images from microscopy
  • Computational analysis of brain structure and function
  • Neural network theories of brain function

Jaswinder Pal Singh, Room 324

  • Research Areas: Boundary of technology and business/applications; building and scaling technology companies with special focus at that boundary; parallel computing systems and applications: parallel and distributed applications and their implications for software and architectural design; system software and programming environments for multiprocessors.
  • Develop a startup company idea, and build a plan/prototype for it.
  • Explore tradeoffs at the boundary of technology/product and business/applications in a chosen area.
  • Study and develop methods to infer insights from data in different application areas, from science to search to finance to others. 
  • Design and implement a parallel application. Possible areas include graphics, compression, biology, among many others. Analyze performance bottlenecks using existing tools, and compare programming models/languages.
  • Design and implement a scalable distributed algorithm.

Mona Singh, Room 420

  • Research Areas: computational molecular biology, as well as its interface with machine learning and algorithms.
  • Whole and cross-genome methods for predicting protein function and protein-protein interactions.
  • Analysis and prediction of biological networks.
  • Computational methods for inferring specific aspects of protein structure from protein sequence data.
  • Any other interesting project in computational molecular biology.

Robert Tarjan, 194 Nassau St., Room 308

  • Research Areas: Data structures; graph algorithms; combinatorial optimization; computational complexity; computational geometry; parallel algorithms.
  • Implement one or more data structures or combinatorial algorithms to provide insight into their empirical behavior.
  • Design and/or analyze various data structures and combinatorial algorithms.

Olga Troyanskaya, Room 320

  • Research Areas: Bioinformatics; analysis of large-scale biological data sets (genomics, gene expression, proteomics, biological networks); algorithms for integration of data from multiple data sources; visualization of biological data; machine learning methods in bioinformatics.
  • Implement and evaluate one or more gene expression analysis algorithm.
  • Develop algorithms for assessment of performance of genomic analysis methods.
  • Develop, implement, and evaluate visualization tools for heterogeneous biological data.

David Walker, Room 211

  • Research Areas: Programming languages, type systems, compilers, domain-specific languages, software-defined networking and security
  • Independent Research Topics:  Any other interesting project that involves humanitarian hacking, functional programming, domain-specific programming languages, type systems, compilers, software-defined networking, fault tolerance, language-based security, theorem proving, logic or logical frameworks.

Shengyi Wang, Postdoctoral Research Associate, Room 216

Available for Fall 2024 single-semester IW, only

  • Independent Research topics: Explore Escher-style tilings using (introductory) group theory and automata theory to produce beautiful pictures.

Kevin Wayne, Corwin Hall, Room 040

  • Research Areas: design, analysis, and implementation of algorithms; data structures; combinatorial optimization; graphs and networks.
  • Design and implement computer visualizations of algorithms or data structures.
  • Develop pedagogical tools or programming assignments for the computer science curriculum at Princeton and beyond.
  • Develop assessment infrastructure and assessments for MOOCs.

Matt Weinberg, 194 Nassau St., Room 222

  • Research Areas: algorithms, algorithmic game theory, mechanism design, game theoretical problems in {Bitcoin, networking, healthcare}.
  • Theoretical questions related to COS 445 topics such as matching theory, voting theory, auction design, etc. 
  • Theoretical questions related to incentives in applications like Bitcoin, the Internet, health care, etc. In a little bit more detail: protocols for these systems are often designed assuming that users will follow them. But often, users will actually be strictly happier to deviate from the intended protocol. How should we reason about user behavior in these protocols? How should we design protocols in these settings?

Huacheng Yu, Room 310

  • data structures
  • streaming algorithms
  • design and analyze data structures / streaming algorithms
  • prove impossibility results (lower bounds)
  • implement and evaluate data structures / streaming algorithms

Ellen Zhong, Room 314

Opportunities outside the department.

We encourage students to look in to doing interdisciplinary computer science research and to work with professors in departments other than computer science.  However, every CS independent work project must have a strong computer science element (even if it has other scientific or artistic elements as well.)  To do a project with an adviser outside of computer science you must have permission of the department.  This can be accomplished by having a second co-adviser within the computer science department or by contacting the independent work supervisor about the project and having he or she sign the independent work proposal form.

Here is a list of professors outside the computer science department who are eager to work with computer science undergraduates.

Maria Apostolaki, Engineering Quadrangle, C330

  • Research areas: Computing & Networking, Data & Information Science, Security & Privacy

Branko Glisic, Engineering Quadrangle, Room E330

  • Documentation of historic structures
  • Cyber physical systems for structural health monitoring
  • Developing virtual and augmented reality applications for documenting structures
  • Applying machine learning techniques to generate 3D models from 2D plans of buildings
  •  Contact : Rebecca Napolitano, rkn2 (@princeton.edu)

Mihir Kshirsagar, Sherrerd Hall, Room 315

Center for Information Technology Policy.

  • Consumer protection
  • Content regulation
  • Competition law
  • Economic development
  • Surveillance and discrimination

Sharad Malik, Engineering Quadrangle, Room B224

Select a Senior Thesis Adviser for the 2020-21 Academic Year.

  • Design of reliable hardware systems
  • Verifying complex software and hardware systems

Prateek Mittal, Engineering Quadrangle, Room B236

  • Internet security and privacy 
  • Social Networks
  • Privacy technologies, anonymous communication
  • Network Science
  • Internet security and privacy: The insecurity of Internet protocols and services threatens the safety of our critical network infrastructure and billions of end users. How can we defend end users as well as our critical network infrastructure from attacks?
  • Trustworthy social systems: Online social networks (OSNs) such as Facebook, Google+, and Twitter have revolutionized the way our society communicates. How can we leverage social connections between users to design the next generation of communication systems?
  • Privacy Technologies: Privacy on the Internet is eroding rapidly, with businesses and governments mining sensitive user information. How can we protect the privacy of our online communications? The Tor project (https://www.torproject.org/) is a potential application of interest.

Ken Norman,  Psychology Dept, PNI 137

  • Research Areas: Memory, the brain and computation 
  • Lab:  Princeton Computational Memory Lab

Potential research topics

  • Methods for decoding cognitive state information from neuroimaging data (fMRI and EEG) 
  • Neural network simulations of learning and memory

Caroline Savage

Office of Sustainability, Phone:(609)258-7513, Email: cs35 (@princeton.edu)

The  Campus as Lab  program supports students using the Princeton campus as a living laboratory to solve sustainability challenges. The Office of Sustainability has created a list of campus as lab research questions, filterable by discipline and topic, on its  website .

An example from Computer Science could include using  TigerEnergy , a platform which provides real-time data on campus energy generation and consumption, to study one of the many energy systems or buildings on campus. Three CS students used TigerEnergy to create a  live energy heatmap of campus .

Other potential projects include:

  • Apply game theory to sustainability challenges
  • Develop a tool to help visualize interactions between complex campus systems, e.g. energy and water use, transportation and storm water runoff, purchasing and waste, etc.
  • How can we learn (in aggregate) about individuals’ waste, energy, transportation, and other behaviors without impinging on privacy?

Janet Vertesi, Sociology Dept, Wallace Hall, Room 122

  • Research areas: Sociology of technology; Human-computer interaction; Ubiquitous computing.
  • Possible projects: At the intersection of computer science and social science, my students have built mixed reality games, produced artistic and interactive installations, and studied mixed human-robot teams, among other projects.

David Wentzlaff, Engineering Quadrangle, Room 228

Computing, Operating Systems, Sustainable Computing.

  • Instrument Princeton's Green (HPCRC) data center
  • Investigate power utilization on an processor core implemented in an FPGA
  • Dismantle and document all of the components in modern electronics. Invent new ways to build computers that can be recycled easier.
  • Other topics in parallel computer architecture or operating systems

Facebook

Top 7 Software Engineering Trends for 2023

HackerRank AI Promotion

In the fast-paced realm of software engineering, staying up to date with the latest trends is paramount. The landscape is constantly evolving, with new technologies and methodologies redefining the way we approach development, enhancing user experiences, and introducing new possibilities for businesses across industries. And 2023 will be no different. 

Already this year the tech headlines have been dominated by advancements in artificial intelligence ,   natural language processing , edge computing , and 5G . And these are just a few of the software engineering trends we expect to take shape this year. In this article, we’ll take a deeper look at how these technologies — and others — are evolving and the impact they’ll have on the software engineering landscape in 2023 and beyond.

Artificial Intelligence 

Artificial Intelligence (AI) has become more than just a buzzword; it is now a driving force behind innovation in the field of software engineering. With its ability to simulate human intelligence and automate tasks, AI is transforming the way software is developed, deployed, and used across industries. In 2022, machine learning was the most in-demand technical skill in the world, and in 2023, as AI and ML become even more deeply embedded in software engineering, we expect to see demand for professionals with these skills to remain high. 

One of the key areas where AI is making a significant impact is in automating repetitive tasks. Software engineers can leverage AI-powered tools and frameworks to automate mundane and time-consuming activities, such as code generation, testing, and debugging. This enables developers to focus on higher-level problem-solving and creativity, leading to faster and more efficient development cycles.

AI also plays a crucial role in enhancing decision-making processes. Through machine learning algorithms, software engineers can develop intelligent systems that analyze large datasets, identify patterns, and make predictions. This capability has far-reaching implications, ranging from personalized recommendations in e-commerce platforms to predictive maintenance in manufacturing industries.

Furthermore, AI is revolutionizing user experiences. Natural language processing (NLP) and computer vision are just a couple of AI subfields that enable software engineers to build applications with advanced capabilities. Chatbots that can understand and respond to user queries, image recognition systems that identify objects and faces, and voice assistants that make interactions more intuitive are all examples of AI-powered applications that enrich user experiences.

As AI continues to evolve, its applications are expanding into healthcare, finance, autonomous vehicles, and many other industries. Understanding AI and its potential empowers software engineers to harness its capabilities and drive innovation in their respective fields. 

As software applications become increasingly complex and distributed, the need for efficient management of containers and microservices has become crucial. This is where Kubernetes , an open-source container orchestration platform, comes into play. 

At its core, Kubernetes simplifies the management of containerized applications. Containers allow developers to package applications and their dependencies into portable and isolated units, ensuring consistency across different environments. Kubernetes takes containerization to the next level by automating the deployment, scaling, and management of these containers.

One of the key benefits of Kubernetes is its ability to enable horizontal scaling. By distributing containers across multiple nodes, Kubernetes ensures that applications can handle increasing traffic loads effectively. It automatically adjusts the number of containers based on demand, ensuring optimal utilization of resources.

Kubernetes also enhances fault tolerance and resilience. If a container or node fails, Kubernetes automatically detects and replaces it, ensuring that applications remain available and responsive. It enables self-healing capabilities, ensuring that the desired state of the application is always maintained.

Furthermore, Kubernetes promotes declarative configuration and infrastructure as code practices. Through the use of YAML-based configuration files, developers can define the desired state of their applications and infrastructure. This allows for reproducibility, version control, and easier collaboration among teams.

As the ecosystem surrounding Kubernetes continues to evolve and become more complex and sophisticated, both adoption of the Kubernetes platform and demand for professionals with Kubernetes experience will continue to grow.

Edge Computing

In the era of rapidly growing data volumes and increasing demand for real-time processing, edge computing has emerged as a crucial software engineering trend that supports cloud optimization and innovation within the IoT space . Edge computing brings computing resources closer to the data source, reducing latency, enhancing performance, and enabling near-instantaneous decision-making.

Traditional cloud computing relies on centralized data centers located far from the end users. In contrast, edge computing pushes computational capabilities to the edge of the network, closer to where the data is generated. This approach is particularly valuable in scenarios where real-time processing and low latency are critical, such as autonomous vehicles, industrial automation, and Internet of Things (IoT) applications.

By processing data at the edge, edge computing minimizes the need for data transmission to the cloud, reducing network congestion and latency. This is especially beneficial in situations where network connectivity is limited, unreliable, or costly. Edge Computing enables quicker response times and can support applications that require immediate actions, such as detecting anomalies, triggering alarms, or providing real-time feedback.

One of the key advantages of Edge Computing is its ability to address privacy and security concerns. With data being processed and analyzed locally, sensitive information can be kept closer to its source, reducing the risk of unauthorized access or data breaches. This is particularly significant in sectors like healthcare and finance, where data privacy and security are paramount.

According to a report by Cybersecurity Ventures , the global annual cost of cybercrime is expected to reach $8 trillion in 2023. Security is more important than ever, which has led many engineering organizations to reconsider the way they approach and implement security practices. And that’s where DevSecOps comes into play. 

DevSecOps , an evolution of the DevOps philosophy, integrates security practices throughout the entire software development lifecycle, ensuring that security is not an afterthought but an integral part of the process. Adoption of this new approach to development continues to gain momentum, with 56% of developers reporting their teams use DevSecOps and DevOps methodologies — up from 47% in 2022.

One of the key benefits of DevSecOps is the ability to identify and mitigate security vulnerabilities early in the development cycle. By conducting security assessments, code reviews, and automated vulnerability scanning, software engineers can identify potential risks and address them proactively. This proactive approach minimizes the likelihood of security breaches and reduces the cost and effort required for remediation later on.

DevSecOps also enables faster and more secure software delivery. By integrating security checks into the continuous integration and continuous deployment (CI/CD) pipeline, software engineers can automate security testing and validation. This ensures that each code change is thoroughly assessed for security vulnerabilities before being deployed to production, reducing the risk of introducing vulnerabilities into the software.

Collaboration is a fundamental aspect of DevSecOps. Software engineers work closely with security teams and operations teams to establish shared responsibilities and ensure that security practices are integrated seamlessly into the development process. This collaborative effort promotes a culture of shared ownership and accountability for security, enabling faster decision-making and more effective risk mitigation.

Progressive Web Applications

In an era where mobile devices dominate our daily lives, progressive web applications (PWAs) have emerged as a significant software engineering trend, with desktop installations of PWAs growing by 270 percent since 2021. PWAs bridge the gap between traditional websites and native mobile applications, offering the best of both worlds. These web applications provide a seamless and immersive user experience while leveraging the capabilities of modern web technologies.

PWAs are designed to be fast, responsive, and reliable, allowing users to access them instantly, regardless of network conditions. Unlike traditional web applications that require a constant internet connection, PWAs can work offline or with a poor network connection. By caching key resources, such as HTML , CSS , and JavaScript files, PWAs ensure that users can access content and perform actions even when they are offline. This enhances the user experience and allows applications to continue functioning seamlessly in challenging network conditions.

One of the key advantages of PWAs is their cross-platform compatibility. Unlike native mobile applications that require separate development efforts for different platforms (e.g., Android and iOS), PWAs are built once and can run on any device with a modern web browser. This significantly reduces development time and costs while expanding the potential user base.

PWAs are also discoverable and shareable. They can be indexed by search engines, making them more visible to users searching for relevant content. Additionally, PWAs can be easily shared via URLs, enabling users to share specific app screens or features with others.

As we venture into 2023, PWAs continue to gain traction, blurring the lines between web and mobile applications. 

The global Web 3.0 market size stood at $2.2 billion in 2022 and is set to grow by a compounded annual growth rate of 44.5 percent, reaching $81.9 billion by 2032. Also known as the Semantic Web, Web 3.0 is an exciting software engineering trend that aims to enhance the capabilities and intelligence of the World Wide Web. Building upon the foundation of Web 2.0, which focused on user-generated content and interactivity, Web 3.0 takes it a step further by enabling machines to understand and process web data, leading to a more intelligent and personalized online experience.

The core concept behind Web 3.0 is the utilization of semantic technologies and artificial intelligence to organize, connect, and extract meaning from vast amounts of web data. This enables computers and applications to not only display information but also comprehend its context and relationships, making the web more intuitive and interactive.

One of the key benefits of Web 3.0 is its ability to provide a more personalized and tailored user experience. By understanding user preferences, behavior, and context, Web 3.0 applications can deliver highly relevant content, recommendations, and services. For example, an e-commerce website powered by Web 3.0 can offer personalized product recommendations based on a user’s browsing history, purchase patterns, and preferences.

Web 3.0 also facilitates the development of intelligent agents and chatbots that can understand and respond to natural language queries, enabling more efficient and interactive user interactions. These intelligent agents can assist with tasks such as customer support, information retrieval, and decision-making.

5G , the fifth generation of wireless technology, is set to revolutionize connectivity and enable a new era of innovation. With its promise of ultra-fast speeds, low latency, and high capacity, 5G opens up a world of possibilities for software engineers, paving the way for advancements in areas such as autonomous vehicles, smart cities, Internet of Things, and immersive experiences. And as mobile networks continue to grow and consumers adopt more 5G devices, more and more companies are investing in the development of applications that take advantage of 5G’s capabilities . 

One of the most significant advantages of 5G is its remarkable speed. With download speeds reaching up to 10 gigabits per second, 5G enables lightning-fast data transfer, allowing for real-time streaming, seamless video calls, and rapid file downloads. This enhanced speed unlocks new possibilities for high-bandwidth applications, such as 4K and 8K video streaming, virtual reality, and augmented reality experiences.

Low latency is another key feature of 5G. Latency refers to the time it takes for data to travel from one point to another. With 5G, latency is significantly reduced, enabling near-instantaneous communication and response times. This is crucial for applications that require real-time interactions, such as autonomous vehicles that rely on split-second decision-making or remote robotic surgeries where even a slight delay can have serious consequences.

Moreover, 5G has the potential to connect a massive number of devices simultaneously, thanks to its increased capacity. This makes it ideal for powering the Internet of Things (IoT), where billions of devices can seamlessly communicate with each other and the cloud. From smart homes and wearables to industrial sensors and smart grids, 5G’s high capacity enables a truly connected and intelligent ecosystem.

Key Takeaways

As you can see, the software engineering landscape in 2023 will be marked by an exciting array of trends that are shaping the future of technology and innovation. Embracing these software engineering trends allows businesses and software engineers alike to harness their potential and create innovative solutions that meet the evolving needs of users. To learn more about the type of tech professionals and skills needed to build the future of software, check out HackerRank’s roles directory .

This article was written with the help of AI. Can you tell which parts? 

Get started with HackerRank

Over 2,500 companies and 40% of developers worldwide use HackerRank to hire tech talent and sharpen their skills.

Recommended topics

HackerRank and EY blog post on Optimizing Hiring

Optimizing for Excellence: EY’s Modern Approaches to Streamlining Hiring Processes

Navigation Menu

Search code, repositories, users, issues, pull requests..., provide feedback.

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly.

To see all available qualifiers, see our documentation .

  • Notifications You must be signed in to change notification settings

📚 A curated list of papers for Software Engineers

facundoolano/software-papers

Folders and files, repository files navigation, papers for software engineers.

A curated list of papers that may be of interest to Software Engineering students or professionals. See the sources and selection criteria below.

Von Neumann's First Computer Program. Knuth (1970) . Computer History; Early Programming

  • The Education of a Computer. Hopper (1952) .
  • Recursive Programming. Dijkstra (1960) .
  • Programming Considered as a Human Activity. Dijkstra (1965) .
  • Goto Statement Considered Harmful. Dijkstra (1968) .
  • Program development by stepwise refinement. Wirth (1971) .
  • The Humble Programmer. Dijkstra (1972) .
  • Computer Programming as an Art. Knuth (1974) .
  • The paradigms of programming. Floyd (1979) .
  • Literate Programming. Knuth (1984) .

Computing Machinery and Intelligence. Turing (1950) . Early Artificial Intelligence

  • Some Moral and Technical Consequences of Automation. Wiener (1960) .
  • Steps towards Artificial Intelligence. Minsky (1960) .
  • ELIZA—a computer program for the study of natural language communication between man and machine. Weizenbaum (1966) .
  • A Theory of the Learnable. Valiant (1984) .

A Method for the Construction of Minimum-Redundancy Codes. Huffman (1952) . Information Theory

  • A Universal Algorithm for Sequential Data Compression. Ziv, Lempel (1977) .
  • Fifty Years of Shannon Theory. Verdú (1998) .

Engineering a Sort Function. Bentley, McIlroy (1993) . Data Structures; Algorithms

  • On the Shortest Spanning Subtree of a Graph and the Traveling Salesman Problem. Kruskal (1956) .
  • A Note on Two Problems in Connexion with Graphs. Dijkstra (1959) .
  • Quicksort. Hoare (1962) .
  • Space/Time Trade-offs in Hash Coding with Allowable Errors. Bloom (1970) .
  • The Ubiquitous B-Tree. Comer (1979) .
  • Programming pearls: Algorithm design techniques. Bentley (1984) .
  • Programming pearls: The back of the envelope. Bentley (1984) .
  • Making data structures persistent. Driscoll et al (1986) .

A Design Methodology for Reliable Software Systems. Liskov (1972) . Software Design

  • On the Criteria To Be Used in Decomposing Systems into Modules. Parnas (1971) .
  • Information Distribution Aspects of Design Methodology. Parnas (1972) .
  • Designing Software for Ease of Extension and Contraction. Parnas (1979) .
  • Programming as Theory Building. Naur (1985) .
  • Software Aging. Parnas (1994) .
  • Towards a Theory of Conceptual Design for Software. Jackson (2015) .

Programming with Abstract Data Types. Liskov, Zilles (1974) . Abstract Data Types; Object-Oriented Programming

  • The Smalltalk-76 Programming System Design and Implementation. Ingalls (1978) .
  • A Theory of Type Polymorphism in Programming. Milner (1978) .
  • On understanding types, data abstraction, and polymorphism. Cardelli, Wegner (1985) .
  • SELF: The Power of Simplicity. Ungar, Smith (1991) .

Why Functional Programming Matters. Hughes (1990) . Functional Programming

  • Recursive Functions of Symbolic Expressions and Their Computation by Machine. McCarthy (1960) .
  • The Semantics of Predicate Logic as a Programming Language. Van Emden, Kowalski (1976) .
  • Can Programming Be Liberated from the von Neumann Style? Backus (1978) .
  • The Semantic Elegance of Applicative Languages. Turner (1981) .
  • The essence of functional programming. Wadler (1992) .
  • QuickCheck: A Lightweight Tool for Random Testing of Haskell Programs. Claessen, Hughes (2000) .
  • Church's Thesis and Functional Programming. Turner (2006) .

An Incremental Approach to Compiler Construction. Ghuloum (2006) . Language Design; Compilers

  • The Next 700 Programming Languages. Landin (1966) .
  • Programming pearls: little languages. Bentley (1986) .
  • The Essence of Compiling with Continuations. Flanagan et al (1993) .
  • A Brief History of Just-In-Time. Aycock (2003) .
  • LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation. Lattner, Adve (2004) .
  • A Unified Theory of Garbage Collection. Bacon, Cheng, Rajan (2004) .
  • A Nanopass Framework for Compiler Education. Sarkar, Waddell, Dybvig (2005) .
  • Bringing the Web up to Speed with WebAssembly. Haas (2017) .

No Silver Bullet: Essence and Accidents of Software Engineering. Brooks (1987) . Software Engineering; Project Management

  • How do committees invent? Conway (1968) .
  • Managing the Development of Large Software Systems. Royce (1970) .
  • The Mythical Man Month. Brooks (1975) .
  • On Building Systems That Will Fail. Corbató (1991) .
  • The Cathedral and the Bazaar. Raymond (1998) .
  • Out of the Tar Pit. Moseley, Marks (2006) .

Communicating sequential processes. Hoare (1978) . Concurrency

  • Solution Of a Problem in Concurrent Program Control. Dijkstra (1965) .
  • Monitors: An operating system structuring concept. Hoare (1974) .
  • On the Duality of Operating System Structures. Lauer, Needham (1978) .
  • Software Transactional Memory. Shavit, Touitou (1997) .

The UNIX Time- Sharing System. Ritchie, Thompson (1974) . Operating Systems

  • An Experimental Time-Sharing System. Corbató, Merwin Daggett, Daley (1962) .
  • The Structure of the "THE"-Multiprogramming System. Dijkstra (1968) .
  • The nucleus of a multiprogramming system. Hansen (1970) .
  • Reflections on Trusting Trust. Thompson (1984) .
  • The Design and Implementation of a Log-Structured File System. Rosenblum, Ousterhout (1991) .

A Relational Model of Data for Large Shared Data Banks. Codd (1970) . Databases

  • Granularity of Locks and Degrees of Consistency in a Shared Data Base. Gray et al (1975) .
  • Access Path Selection in a Relational Database Management System. Selinger et al (1979) .
  • The Transaction Concept: Virtues and Limitations. Gray (1981) .
  • The design of POSTGRES. Stonebraker, Rowe (1986) .
  • Rules of Thumb in Data Engineering. Gray, Shenay (1999) .

A Protocol for Packet Network Intercommunication. Cerf, Kahn (1974) . Networking

  • Ethernet: Distributed packet switching for local computer networks. Metcalfe, Boggs (1978) .
  • End-To-End Arguments in System Design. Saltzer, Reed, Clark (1984) .
  • An algorithm for distributed computation of a Spanning Tree in an Extended LAN. Perlman (1985) .
  • The Design Philosophy of the DARPA Internet Protocols. Clark (1988) .
  • TOR: The second generation onion router. Dingledine et al (2004) .
  • Why the Internet only just works. Handley (2006) .
  • The Network is Reliable. Bailis, Kingsbury (2014) .

New Directions in Cryptography. Diffie, Hellman (1976) . Cryptography

  • A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Rivest, Shamir, Adleman (1978) .
  • How To Share A Secret. Shamir (1979) .
  • A Digital Signature Based on a Conventional Encryption Function. Merkle (1987) .
  • The Salsa20 family of stream ciphers. Bernstein (2007) .

Time, Clocks, and the Ordering of Events in a Distributed System. Lamport (1978) . Distributed Systems

  • Self-stabilizing systems in spite of distributed control. Dijkstra (1974) .
  • The Byzantine Generals Problem. Lamport, Shostak, Pease (1982) .
  • Impossibility of Distributed Consensus With One Faulty Process. Fisher, Lynch, Patterson (1985) .
  • Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial. Schneider (1990) .
  • Practical Byzantine Fault Tolerance. Castro, Liskov (1999) .
  • Paxos made simple. Lamport (2001) .
  • Paxos made live - An Engineering Perspective. Chandra, Griesemer, Redstone (2007) .
  • In Search of an Understandable Consensus Algorithm. Ongaro, Ousterhout (2014) .

Designing for Usability: Key Principles and What Designers Think. Gould, Lewis (1985) . Human-Computer Interaction; User Interfaces

  • As We May Think. Bush (1945) .
  • Man-Computer symbiosis. Licklider (1958) .
  • Some Thoughts About the Social Implications of Accessible Computing. David, Fano (1965) .
  • Tutorials for the First-Time Computer User. Al-Awar, Chapanis, Ford (1981) .
  • The star user interface: an overview. Smith, Irby, Kimball (1982) .
  • Design Principles for Human-Computer Interfaces. Norman (1983) .
  • Human-Computer Interaction: Psychology as a Science of Design. Carroll (1997) .

The anatomy of a large-scale hypertextual Web search engine. Brin, Page (1998) . Information Retrieval; World-Wide Web

  • A Statistical Interpretation of Term Specificity in Retrieval. Spärck Jones (1972) .
  • World-Wide Web: Information Universe. Berners-Lee et al (1992) .
  • The PageRank Citation Ranking: Bringing Order to the Web. Page, Brin, Motwani (1998) .

Dynamo, Amazon’s Highly Available Key-value store. DeCandia et al (2007) . Internet Scale Data Systems

  • The Google File System. Ghemawat, Gobioff, Leung (2003) .
  • MapReduce: Simplified Data Processing on Large Clusters. Dean, Ghemawat (2004) .
  • Bigtable: A Distributed Storage System for Structured Data. Chang et al (2006) .
  • ZooKeeper: wait-free coordination for internet scale systems. Hunt et al (2010) .
  • The Hadoop Distributed File System. Shvachko et al (2010) .
  • Kafka: a Distributed Messaging System for Log Processing. Kreps, Narkhede, Rao (2011) .
  • CAP Twelve Years Later: How the "Rules" Have Changed. Brewer (2012) .
  • Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases. Verbitski et al (2017) .

On Designing and Deploying Internet Scale Services. Hamilton (2007) . Operations; Reliability; Fault-tolerance

  • Ironies of Automation. Bainbridge (1983) .
  • Why do computers stop and what can be done about it? Gray (1985) .
  • Recovery Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Patterson et al (2002) .
  • Crash-Only Software. Candea, Fox (2003) .
  • Building on Quicksand. Helland, Campbell (2009) .

Thinking Methodically about Performance. Gregg (2012) . Performance

  • Performance Anti-Patterns. Smaalders (2006) .
  • Thinking Clearly about Performance. Millsap (2010) .

Bitcoin, A peer-to-peer electronic cash system. Nakamoto (2008) . Crytpocurrencies

  • Ethereum: A Next-Generation Smart Contract and Decentralized Application Platform. Buterin (2014) .

A Few Useful Things to Know About Machine Learning. Domingos (2012) . Machine Learning

  • Statistical Modeling: The Two Cultures. Breiman (2001) .
  • The Unreasonable Effectiveness of Data. Halevy, Norvig, Pereira (2009) .
  • ImageNet Classification with Deep Convolutional Neural Networks. Krizhevsky, Sutskever, Hinton (2012) .
  • Playing Atari with Deep Reinforcement Learning. Mnih et al (2013) .
  • Generative Adversarial Nets. Goodfellow et al (2014) .
  • Deep Learning. LeCun, Bengio, Hinton (2015) .
  • Attention Is All You Need. Vaswani et al (2017) .
  • Von Neumann's First Computer Program. Knuth (1970) .
  • Computing Machinery and Intelligence. Turing (1950) .
  • A Method for the Construction of Minimum-Redundancy Codes. Huffman (1952) .
  • Engineering a Sort Function. Bentley, McIlroy (1993) .
  • A Design Methodology for Reliable Software Systems. Liskov (1972) .
  • Programming with Abstract Data Types. Liskov, Zilles (1974) .
  • Why Functional Programming Matters. Hughes (1990) .
  • An Incremental Approach to Compiler Construction. Ghuloum (2006) .
  • No Silver Bullet: Essence and Accidents of Software Engineering. Brooks (1987) .
  • Communicating sequential processes. Hoare (1978) .
  • The UNIX Time- Sharing System. Ritchie, Thompson (1974) .
  • A Relational Model of Data for Large Shared Data Banks. Codd (1970) .
  • A Protocol for Packet Network Intercommunication. Cerf, Kahn (1974) .
  • New Directions in Cryptography. Diffie, Hellman (1976) .
  • Time, Clocks, and the Ordering of Events in a Distributed System. Lamport (1978) .
  • Designing for Usability: Key Principles and What Designers Think. Gould, Lewis (1985) .
  • The anatomy of a large-scale hypertextual Web search engine. Brin, Page (1998) .
  • Dynamo, Amazon’s Highly Available Key-value store. DeCandia et al (2007) .
  • On Designing and Deploying Internet Scale Services. Hamilton (2007) .
  • Thinking Methodically about Performance. Gregg (2012) .
  • Bitcoin, A peer-to-peer electronic cash system. Nakamoto (2008) .
  • A Few Useful Things to Know About Machine Learning. Domingos (2012) .

This list was inspired by (and draws from) several books and paper collections:

  • Papers We Love
  • Ideas That Created the Future
  • The Innovators
  • The morning paper
  • Distributed systems for fun and profit
  • Readings in Database Systems (the Red Book)
  • Fermat's Library
  • Classics in Human-Computer Interaction
  • Awesome Compilers
  • Distributed Consensus Reading List
  • The Decade of Deep Learning

A few interesting resources about reading papers from Papers We Love and elsewhere:

  • Should I read papers?
  • How to Read an Academic Article
  • How to Read a Paper. Keshav (2007) .
  • Efficient Reading of Papers in Science and Technology. Hanson (1999) .
  • On ICSE’s “Most Influential Papers”. Parnas (1995) .

Selection criteria

  • The idea is not to include every interesting paper that I come across but rather to keep a representative list that's possible to read from start to finish with a similar level of effort as reading a technical book from cover to cover.
  • I tried to include one paper per each major topic and author. Since in the process I found a lot of noteworthy alternatives, related or follow-up papers and I wanted to keep track of those as well, I included them as sublist items.
  • The papers shouldn't be too long. For the same reasons as the previous item, I try to avoid papers longer than 20 or 30 pages.
  • They should be self-contained and readable enough to be approachable by the casual technical reader.
  • They should be freely available online.
  • Examples of this are classic works by Von Neumann, Turing and Shannon.
  • That being said, where possible I preferred the original paper on each subject over modern updates or survey papers.
  • Similarly, I tended to skip more theoretical papers, those focusing on mathematical foundations for Computer Science, electronic aspects of hardware, etc.
  • I sorted the list by a mix of relatedness of topics and a vague chronological relevance, such that it makes sense to read it in the suggested order. For example, historical and seminal topics go first, contemporary internet-era developments last, networking precedes distributed systems, etc.

Sponsor this project

Contributors 4.

@facundoolano

  • Python 100.0%

Research Topics in Software Engineering

best research topics software engineering

This seminar is an opportunity to become familiar with current research in software engineering and more generally with the methods and challenges of scientific research.

Each student will be asked to study some papers from the recent software engineering literature and review them. This is an exercise in critical review and analysis. Active participation is required (a presentation of a paper as well as participation in discussions).

The aim of this seminar is to introduce students to recent research results in the area of programming languages and software engineering. To accomplish that, students will study and present research papers in the area as well as participate in paper discussions. The papers will span topics in both theory and practice, including papers on program verification, program analysis, testing, programming language design, and development tools.

Topic modeling in software engineering research

  • Open access
  • Published: 06 September 2021
  • Volume 26 , article number  120 , ( 2021 )

Cite this article

You have full access to this open access article

best research topics software engineering

  • Camila Costa Silva   ORCID: orcid.org/0000-0002-3690-1711 1 ,
  • Matthias Galster   ORCID: orcid.org/0000-0003-3491-1833 1 &
  • Fabian Gilson   ORCID: orcid.org/0000-0002-1465-3315 1  

7244 Accesses

37 Citations

1 Altmetric

Explore all metrics

Topic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic “topics” (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to support software engineering tasks (e.g., to support source code comprehension). Topic modeling needs to be applied carefully (e.g., depending on the type of textual data analyzed and modeling parameters). Our study aims at describing how topic modeling has been applied in software engineering research with a focus on four aspects: (1) which topic models and modeling techniques have been applied, (2) which textual inputs have been used for topic modeling, (3) how textual data was “prepared” (i.e., pre-processed) for topic modeling, and (4) how generated topics (i.e., word clusters) were named to give them a human-understandable meaning. We analyzed topic modeling as applied in 111 papers from ten highly-ranked software engineering venues (five journals and five conferences) published between 2009 and 2020. We found that (1) LDA and LDA-based techniques are the most frequent topic modeling techniques, (2) developer communication and bug reports have been modelled most, (3) data pre-processing and modeling parameters vary quite a bit and are often vaguely reported, and (4) manual topic naming (such as deducting names based on frequent words in a topic) is common.

Similar content being viewed by others

Semantic topic models for source code analysis, a survey on the use of topic models when mining software repositories.

best research topics software engineering

Latent Dirichlet Allocation (LDA) Based on Automated Bug Severity Prediction Model

Avoid common mistakes on your manuscript.

1 Introduction

Text mining is about searching, extracting and processing text to provide meaningful insights from the text based on a certain goal. Techniques for text mining include natural language processing (NLP) to process, search and understand the structure of text (e.g., part-of-speech tagging), web mining to discover information resources on the web (e.g., web crawling), and information extraction to extract structured information from unstructured text and relationships between pieces of information (e.g., co-reference, entity extraction) (Miner et al. 2012 ). Text mining has been widely used in software engineering research (Bi et al. 2018 ), for example, to uncover architectural design decisions in developer communication (Soliman et al. 2016 ) or to link software artifacts to source code (Asuncion et al. 2010 ).

Topic modeling is a text mining and concept extraction method that extracts topics (i.e., coherent word clusters) from large corpora of textual documents to discovery hidden semantic structures in text (Miner et al. 2012 ). An advantage of topic modeling over other techniques is that it helps analyzing long texts (Treude and Wagner 2019 ; Miner et al. 2012 ), creates clusters as “topics” (rather than individual words) and is unsupervised (Miner et al. 2012 ).

Topic modeling has become popular in software engineering research (Sun et al. 2016 ; Chen et al. 2016 ). For example, Sun et al. ( 2016 ) found that topic modeling had been used to support source code comprehension, feature location and defect prediction. Additionally, Chen et al. ( 2016 ) found that many repository mining studies apply topic modeling to textual data such as source code and log messages to recommend code refactoring (Bavota et al. 2014b ) or to localize bugs (Lukins et al. 2010 ).

Probabilistic topic models such as Latent Semantic Indexing (LSI) (Deerwester et al. 1990 ) and Latent Dirichlet Allocation (LDA) (Blei et al. 2003b ) discover topics in a corpus of textual documents, using the statistical properties of word frequencies and co-occurrences (Lin et al. 2014 ). However, Agrawal et al. ( 2018 ) warn about systematic errors in the analysis of LDA topic models that limit the validity of topics. Lin et al. ( 2014 ) also advise that classical topic models usually generate sub-optimal topics when applied “as is” to small amounts or short text documents.

Considering the limitations of topic modeling techniques and topic models on the one hand and their potential usefulness in software engineering on the other hand, our goal is to describe how topic modeling has been applied in software engineering research. In detail, we explore the following research questions:

RQ1. Which topic modeling techniques have been used and for what purpose? There are different topic modeling techniques (see Section  2 ), each with their own limitations and constraints (Chen et al. 2016 ). This RQ aims at understanding which topic modeling techniques have been used (e.g., LDA, LSI) and for what purpose studies applied such techniques (e.g., to support software maintenance tasks). Furthermore, we analyze the types of contributions in studies that used topic modeling (e.g., a new approach as a solution proposal, or an exploratory study).

RQ2. What are the inputs into topic modeling? Topic modeling techniques accept different types of textual documents and require the configuration of parameters (see Section  2.1 ). Carefully choosing parameters (such as the number of topics to be generated) is essential for obtaining valuable and reliable topics (Agrawal et al. 2018 ; Treude and Wagner 2019 ). This RQ aims at analysing types of textual data (e.g., source code), actual documents (e.g., a Java class or an individual Java method) and configured parameters used for topic modeling to address software engineering problems.

RQ3: How are data pre-processed for topic modeling? Topic modeling requires that the analyzed text is pre-processed (e.g., by removing stop words) to improve the quality of the produced output (Aggarwal and Zhai 2012 ; Bi et al. 2018 ). This RQ aims at analysing how previous studies pre-processed textual data for topic modeling, including the steps for cleaning and transforming text. This will help us understand if there are specific pre-processing steps for a certain topic modeling technique or types of textual data.

RQ4. How are generated topics named? This RQ aims at analyzing if and how topics (word clusters) were named in studies. Giving meaningful names to topics may be difficult but may be required to help humans comprehend topics. For example, naming topics can provide a high-level view on topics discussed by developers in Stack Overflow (a Q&A website) (Barua et al. 2014 ) or by end mobile app users in tweets (Mezouar et al. 2018 ). Analysts (e.g., developers interested in what topics are discussed on Stack Overflow or app reviews) can then look at the name of the topic (i.e., its “label”) rather than the cluster of words. These labels or names must capture the overarching meaning of all words in a topic. We describe different approaches to naming topics generated by a topic model, such as manual or automated labeling of clusters with names based on the most frequent words of a topic (Hindle et al. 2013 ).

In this paper, we provide an overview of the use of topic modeling in 111 papers published between 2009 and 2020 in highly ranked venues of software engineering (five journals and five conferences). We identify characteristics and limitations in the use of topic models and discuss (a) the appropriateness of topic modeling techniques, (b) the importance of pre-processing, (c) challenges related to defining meaningful topics, and (d) the importance of context when manually naming topics.

The rest of the paper is organized as follows. In Section  2 we provide an overview of topic modeling. In Section  3 we describe other literature reviews on the topic as well as “meta-studies” that discuss topic modeling more generally. We describe the research method in Section  4 and present the results in Section  5 . In Section  6 , we summarize our findings and discuss implications and threats to validity. Finally, in Section  7 we present concluding remarks and future work.

2 Topic Modeling

Topic modeling aims at automatically finding topics, typically represented as clusters of words, in a given textual document (Bi et al. 2018 ). Unlike (supervised) machine learning-based techniques that solve classification problems, topic modeling does not use tags, training data or predefined taxonomies of concepts (Bi et al. 2018 ). Based on the frequencies of words and frequencies of co-occurrence of words within one or more documents, topic modeling clusters words that are often used together (Barua et al. 2014 ; Treude and Wagner 2019 ). Figure  1 illustrates the general process of topic modeling, from a raw corpus of documents (“Data input”) to topics generated for these documents (“Output”). Below we briefly introduce the basic concepts and terminology of topic modeling (based on Chen et al. ( 2016 )):

Word w : a string of one or more alphanumeric characters (e.g., “software” or “management”);

Document d : a set of n words (e.g., a text snippet with five words: w 1 to w 5 );

Corpus C : a set of t documents (e.g., nine text snippets: d 1 to d 9 );

Vocabulary V : a set of m unique words that appear in a corpus (e.g., m = 80 unique words across nine documents);

Term-document matrix A : an m by t matrix whose A i , j entry is the weight (according to some weighting function, such as term-frequency) of word w i in document d j . For example, given a matrix A with three words and three documents as

best research topics software engineering

A 1,1 = 5 indicates that “code” appears five times in d 1 , etc.;

Topic z : a collection of terms that co-occur frequently in the documents of a corpus. Considering probabilistic topic models (e.g., LDA), z refers to an m -length vector of probabilities over the vocabulary of a corpus. For example, in a vector z 1 = ( c o d e : 0.35; t e s t : 0.17; b u g : 0.08),

0.35 indicates that when a word is picked from a topic z 1 , there is a 35% chance of drawing the word “code”, etc.;

Topic-term matrix ϕ (or T ): a k by m matrix with k as the number of topics and ϕ i , j the probability of word w j in topic z i . Row i of ϕ corresponds to z i . For example, given a matrix ϕ as

best research topics software engineering

0.05 in the first column indicates that the word “code” appears with a probability of 0.5% in topic z 3 , etc.;

Topic membership vector 𝜃 d : for document d i , a k -length vector of probabilities of the k topics. For example, given a vector \(\theta _{d_{i}} = (z_{1}: 0.25; z_{2}: 0.10; z_{3}: 0.08)\) ,

0.25 indicates that there is a 25% chance of selecting topic z 1 in d i ;

Document-topic matrix 𝜃 (or D ): an n by k matrix with 𝜃 i , j as the probability of topic z j in document d i . Row i of 𝜃 corresponds to \(\theta _{d_{i}}\) . For example, given a matrix 𝜃 as

best research topics software engineering

0.10 in the first column indicates that document d 2 contains topic z 1 with probability of 10%, etc.

figure 1

General topic modeling process

2.1 Data Input

Data used as input into topic modeling can take many forms. This requires decisions on what exactly are documents and what the scope of individual documents is (Miner et al. 2012 ). Therefore, we need to determine which unit of text shall be analyzed (e.g., subject lines of e-mails from a mailing list or the body of e-mails).

To model topics from raw text in a corpus C (see Fig.  1 ), the data needs to be converted into a structured vector-space model, such as the term-document matrix A . This typically also requires some pre-processing. Although each text mining approach (including topic modeling) may require specific pre-processing steps, there are some common steps, such as tokenization, stemming and removing stop words (Miner et al. 2012 ). We discuss pre-processing for topic modeling in more detail when presenting the results for RQ3 in Section  5.4 .

2.2 Modeling

Different models can be used for topic modeling. Models typically differ in how they model topics and underlying assumptions. For example, besides LDA and LSI mentioned before, other examples of topic modeling techniques include Probabilistic Latent Semantic Indexing (pLSI) (Hofmann 1999 ). LSI and pLSI reduce the dimensionality of A using Singular Value Decomposition (SVD) (Hofmann 1999 ). Furthermore, variants of LDA have been proposed, such as Relational Topic Models (RTM) (Chang and Blei 2010 ) and Hierarchical Topic Models (HLDA) (Blei et al. 2003a ). RTM finds relationships between documents based on the generated topics (e.g., if document d 1 contains the topic “microservices”, document d 2 contains the topic “containers” and document d n contains the topic “user interface”, RTM will find a link between documents d 1 and d 2 (Chang and Blei 2010 )). HLDA discovers a hierarchy of topics within a corpus, where each lower level in the hierarchy is more specific than the previous one (e.g., a higher topic “web development” may have subtopics such as “front-end” and “back-end”).

Topic modeling techniques need to be configured for a specific problem, objectives and characteristics of the analyzed text (Treude and Wagner 2019 ; Agrawal et al. 2018 ). For example, Treude and Wagner ( 2019 ) studied parameters, characteristics of text corpora and how the characteristics of a corpus impact the development of a topic modeling technique using LDA. Treude and Wagner ( 2019 ) found that textual data from Stack Overflow (e.g., threads of questions and answers) and GitHub (e.g., README files) require different configurations for the number of generated topics ( k ). Similarly, Barua et al. ( 2014 ) argued that the number of topics depends on the characteristics of the analyzed corpora. Furthermore, the values of modeling parameters (e.g., LDA’s hyperparameters α and β which control an initial topic distribution) can also be adjusted depending on the corpus to improve the quality of topics (Agrawal et al. 2018 ).

By finding words that are often used together in documents in a corpus, a topic modeling technique creates clusters of words or topics z k . Words in such a cluster are usually related in some way, therefore giving the topic a meaning. For example, we can use a topic modeling technique to extract five topics from unstructured document such as a combination of Stack Overflow posts. One of the clusters generated could include the co-occurring words “error”, “debug” and “warn”. We can then manually inspect this cluster and by inference suggest the label “Exceptions” to name this topic (Barua et al. 2014 ).

3 Related Work

3.1 previous literature reviews.

Sun et al. ( 2016 ) and Chen et al. ( 2016 ), similar to our study, surveyed software engineering papers that applied topic modeling. Table  1 shows a comparison between our study and prior reviews. As shown in the table, Sun et al. ( 2016 ) focused on finding which software engineering tasks have been supported by topic models (e.g., support source code comprehension, feature location, traceability link recovery, refactoring, software testing, developer recommendations, software defects prediction and software history comprehension), and Chen et al. ( 2016 ) focused on characterizing how studies used topic modeling to mine software repositories.

Furthermore, as shown in Table  1 , in comparison to Sun et al. ( 2016 ) and Chen et al. ( 2016 ), our study surveys the literature considering other aspects of topic modeling such as data inputs (RQ2), data pre-processing (RQ3), and topic naming (RQ4). Additionally, we searched for papers that applied topic models to any type of data (e.g., Q&A websites) rather than to data in software repositories. We also applied a different search process to identify relevant papers.

Although some of the search venues of these two previous studies and our study overlap, our search focused on specific venues. We also searched papers published between 2009 and 2020, a period which only partially overlaps with the searches presented by Sun et al. ( 2016 ) and Chen et al. ( 2016 ).

Regarding the data analysed in previous studies, Chen et al. ( 2016 ) analyzed two aspects not covered in our study: (a) tools to implement topic models in papers, and (b) how papers evaluated topic models (note that even though we did not cover this aspect explicitly, we checked whether papers compared different topic models, and if so, what metrics they used to compare topic models). However, different to Chen et al. ( 2016 ) we analyzed (a) the types of contribution of papers (e.g., a new approach); (b) details about the types of data and documents used in topic modeling techniques, and (c) whether and how topics were named. Additionally, we extend the survey of Chen et al. ( 2016 ) by investigating hyperparameters (see Section  2.1 ) of topic models and data pre-processing in more detail. We provide more details and a justification of our research method in Section  4 .

3.2 Meta-studies on Topic Modeling

In addition to literature surveys, there are “meta-studies” on topic modeling that address and reflect on different aspects of topic modeling more generally (and are not considered primary studies for the purpose of our review, see our inclusion and exclusion criteria in Section  4 ). In the following paragraphs we organized their discussion into three parts: (1) studies about parameters for topic modeling, (2) studies on topic models based on the type of analyzed data, and (3) studies about metrics and procedures to evaluate the performance of topic models. We refer to these studies throughout this manuscript when reflecting on the findings of our study.

Regarding parameters used for topic modeling, Treude and Wagner ( 2019 ) performed a broad study on LDA parameters to find optimal settings when analyzing GitHub and Stack Overflow text corpora. The authors found that popular rules of thumb for topic modeling parameter configuration were not applicable to their corpora, which required different configurations to achieve good model fit. They also found that it is possible to predict good configurations for unseen corpora reliably. Agrawal et al. ( 2018 ) also performed experiments on LDA parameter configurations and proposed LDADE, a tool to tune the LDA parameters. The authors found that due to LDA topic model instability, using standard LDA with “off-the-shelf” settings is not advisable. We also discuss parameters for topic modeling in Section  2.2 .

For studies on topic models based on the analyzed data, researchers have investigated topic modeling involving short texts (e.g., a tweet) and how to improve the performance of topic models that work well with longer text (e.g., a book chapter) (Lin et al. 2014 ). For example, the study of Jipeng et al. ( 2020 ) compared short-text topic modeling techniques and developed an open-source library of the short-text models. Another example is the work of Mahmoud and Bradshaw ( 2017 ) who discussed topic modeling techniques specific for source code.

Finally, regarding metrics and procedures to evaluate the performance of topic models, some works have explored how semantically meaningful topics are for humans (Chang et al. 2009 ). For example, Poursabzi-Sangdeh et al. ( 2021 ) discuss the importance of interpretability of models in general (also considering other text mining techniques). Another example is the work of Chang et al. ( 2009 ) who presented a method for measuring the interpretability of a topic model based on how well words within topics are related and how different topics are between each other. On the other hand, as an effort to quantify the interpretability of topics without human evaluation, some studies developed topic coherence metrics . These metrics score the probability of a pair of words from topics being found together in (a) external data sources (e.g., Wikipedia pages) or (b) in the documents used by the model that generated those topics (Röder et al. 2015 ). Röder et al. ( 2015 ) combined different implementations of coherence metrics in a framework. Perplexity is another measure of performance for statistical models in natural language processing, which indicates the uncertainty in predicting a single word (Blei et al. 2003b ). This metric is often applied to compare the configurations of a topic modeling technique (e.g., Zhao et al. ( 2020 )). Other studies use perplexity as an indicator of model quality (such as Chen et al. 2019 and Yan et al. 2016b ).

4 Research Method

We conducted a literature survey to describe how topic modeling has been applied in software engineering research. To answer the research questions introduced in Section  1 , we followed general guidelines for systematic literature review (Kitchenham 2004 ) and mapping study methods (Petersen et al. 2015 ). This was to systematically identify relevant works, and to ensure traceability of our findings as well as the repeatability of our study. However, we do not claim to present a fully-fledged systematic literature review (e.g., we did not assess the quality of primary studies) or a mapping study (e.g., we only analyzed papers from carefully selected venues). Furthermore, we used parts of the procedures from other literature surveys on similar topics (Bi et al. 2018 ; Chen et al. 2016 ; Sun et al. 2016 ) as discussed throughout this section.

4.1 Search Procedure

To identify relevant research, we selected high-quality software engineering publication venues. This was to ensure that our literature survey includes studies of high quality and described at sufficient level of detail. We identified venues rated as A and A ∗ for Computer Science and Information Systems research in the Excellence Research for Australia (CORE) ranking (ARC 2012 ). Only one journal was rated B (IST), but we included it due to its relevance for software engineering research. These venues are a subset of venues also searched by related previous literature surveys (Chen et al. 2016 ; Sun et al. 2016 ), see Section  3 . The list of searched venues includes five journals: (1) Empirical Software Engineering (EMSE); (2) Information and Software Technology (IST); (3) Journal of Systems and Software (JSS); (4) ACM Transactions on Software Engineering & Methodology (TOSEM); (5) IEEE Transaction on Software Engineering (TSE). Furthermore, we included five conferences: (1) International Conference on Automated Software Engineering (ASE); (2) ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM); (3) International Symposium on the Foundations of Software Engineering / European Software Engineering Conference (ESEC/FSE); (4) International Conference on Software Engineering (ICSE); (5) International Workshop/Working Conference on Mining Software Repositories (MSR).

We performed a generic search on SpringerLink (EMSE), Science Direct (IST, JSS), ACM DL (TOSEM, ESEC/FSE, ASE, ESEM, ICSE, MSR) and IEEE Xplore (TSE, ASE, ESEM, ICSE, MSR) using the venue (journal or conference) as a high-level filtering criterion. Considering that the proceedings of ASE, ESEM, ICSE and, MSR are published by ACM and IEEE, we searched these venues on ACM DL and IEEE Xplore to avoid missing relevant papers. We used a generic search string (“topic model[l]ing” and “topic model”). Furthermore, in order to find studies that apply specific topic models but do not mention the term “topic model”, we used a second search string with topic model names (“lsi” or “lda” or “plsi” or “latent dirichlet allocation” or “latent semantic”). This second string was based on the search string used by Chen et al. ( 2016 ), who also present a review and analysis of topic modeling techniques in software engineering (see Section  3 ). We applied both strings to the full text and metadata of papers. We considered works published between 2009 and 2020. The search was performed in March 2021. Limiting the search to the last twelve years allowed us to focus on more mature and recent works.

4.2 Study Selection Criteria

We only considered full research papers since full papers typically report (a) mature and complete research, and (b) more details about how topic modeling was applied. Furthermore, to be included, a paper should either apply, experiment with, or propose a topic modeling technique (e.g., develop a topic modeling technique that analyzes source code to recommend refactorings (Bavota et al. 2014b )), and meet none of the exclusion criteria: (a) the paper does not apply topic models (e.g., it applies other text mining techniques and only cites topic modeling in related or future work, such as the paper by Lian et al. ( 2020 ); (b) the paper focuses on theoretical foundation and configurations for topic models (e.g., it discusses how to tune and stabilize topic models, such as Agrawal et al. ( 2018 ) and other meta-studies listed in Section  3.2 ); and (c) the paper is a secondary study (e.g., a literature review like the studies discussed in Section  3.1 ). We evaluated inclusion and exclusion criteria by first reading the abstracts and then reading full texts.

The search with the first search string (see Section  4.1 ) resulted in 215 papers and the search with the second search string resulted in an additional 324 papers. Applying the filtering outlined above resulted in 114 papers. Furthermore, we excluded three papers from the final set of papers: (a) Hindle et al. ( 2011 ), (b) Chen et al. ( 2012 ), and (c) Alipour et al. ( 2013 ). These papers were earlier and shorter versions of follow-up publications; we considered only the latest publications of these papers (Hindle et al. 2013 ; Chen et al. 2017 ; Hindle et al. 2016 ). This resulted in a total of 111 papers for analysis.

4.3 Data Extraction and Synthesis

We defined data items to answer the research questions and characterize the selected papers (see Table  2 ). The extracted data was recorded in a spreadsheet for analysis (raw data are available online Footnote 1 ). One of the authors extracted the data and the other authors reviewed it. In case of ambiguous data, all authors discussed to reach agreement. To synthesize the data, we applied descriptive statistics and qualitatively analyzed the data as follows:

RQ1: Regarding the data item “Technique”, we identified the topic modeling techniques applied in papers. For the data item “Supported tasks”, we assigned to each paper one software engineering task. Tasks emerged during the analysis of papers (see more details in Section  5.2.2 ). We also identified the general study outcome in relation to its goal (data item “Type of contribution”). When analyzing the type of contribution, we also checked whether papers included a comparison of topic modeling techniques (e.g., to select the best technique to be included in a newly proposed approach). Based on these data items we checked which techniques were the most popular, whether techniques were based on other techniques or used together, and for what purpose topic modeling was used.

RQ2: We identified types of data (data item “Type of data”) in selected papers as listed in Section  5.3.1 . Considering that some papers addressed one, two or three different types of data, we counted the frequency of types of data and related them with the document. Regarding “Document”, we identified the textual document and (if reported in the paper) its length. For the data item “Parameters”, we identified whether papers described modeling parameters and if so, which values were assigned to them.

RQ3: Considering that some papers may have not mentioned any pre-processing, we first checked which papers described data pre-processing. Then, we listed all pre-processing steps found and counted their frequencies.

RQ4: Considering the papers that described topic naming, we analyzed how generated topics were named (see Section  5.5 ). We used three types of approaches to describe how topics were named: (a) Manual - manually analysis and labeling of topics; (b) Automated - use automated approaches to label names to topics; and (c) Manual & Automated - mix of both manual and automated approaches to analyse and name topics. We also described the procedures performed to name topics.

5.1 Overview

As mentioned in Section  4.1 , we analyzed 111 papers published between 2009 and 2020 (see Appendix  A.1 - Papers Reviewed). Most papers were published after 2013. Furthermore, most papers were published in journals (68 papers in total, 32 in EMSE alone), while the remaining 43 papers appeared in conferences (mostly MSR with sixteen papers). Table  3 shows the number of papers by venue and year.

5.2 RQ1: Topic Models Used

In this Section we first discuss which topic modeling techniques are used (Section  5.2.1 ). Then, we explore why or for what purpose these techniques were used (Section  5.2.2 ). Finally, we describe the general contributions of papers in relation to their goals (Section  5.2.3 ).

5.2.1 Topic Modeling Techniques

The majority of the papers used LDA (80 out of 111), or a LDA-based technique (30 out of 111), such as Twitter-LDA (Zhao et al. 2011 ). The other topic modeling technique used is LSI. Figure  2 shows the number of papers per topic modeling technique. The total number (125) exceeds the number of papers reviewed (111), because ten papers experimented with more than one technique: Thomas et al. ( 2013 ), De Lucia et al. ( 2014 ), Binkley et al. ( 2015 ), Tantithamthavorn et al. ( 2018 ), Abdellatif et al. ( 2019 ) and Liu et al. ( 2020 ) experimented with LDA and LSI; Chen et al. ( 2014 ) experimented with LDA and Aspect and Sentiment Unification Model (ASUM); Chen et al. ( 2019 ) experimented with Labeled Latent Dirichlet Allocation (LLDA) and Label-to-Hierarchy Model (L2H); Rao and Kak ( 2011 ) experimented with LDA and MLE-LDA; and Hindle et al. ( 2016 ) experimented with LDA and LLDA. ASUM, LLDA, MLE-LDA and L2H are techniques based on LDA.

figure 2

Number of papers per topic modeling technique

The popularity of LDA in software engineering has also been discussed by others, e.g., Treude and Wagner ( 2019 ). LDA is a three-level hierarchical Bayesian model (Blei et al. 2003b ). LDA defines several hyperparameters, such as α (probability of topic z i in document d i ), β (probability of word w i in topic z i ) and k (number of topics to be generated) (Agrawal et al. 2018 ).

Thirty-seven (out of 75) papers applied LDA with Gibbs Sampling (GS). Gibbs sampling is a Markov Chain Monte Carlo algorithm that samples from conditional distributions of a target distribution. Used with LDA, it is an approximate stochastic process for computing α and β (Griffiths and Steyvers 2004 ). According to experiments conducted by Layman et al. ( 2016 ), Gibbs sampling in LDA parameter estimation ( α and β ) resulted in lower perplexity than the Variational Expectation-Maximization (VEM) estimations. Perplexity is a standard measure of performance for statistical models of natural language, which indicates the uncertainty in predicting a single word. Therefore, lower values of perplexity mean better model performance (Griffiths and Steyvers 2004 ).

Thirty papers applied modified or extended versions of LDA (“LDA-based” in Fig.  2 ). Table  4 shows a comparison between these LDA-based techniques. Eleven papers proposed a new extension of LDA to adapt LDA to software engineering problems (hence the same reference in the third and fourth column of Table  4 ). For example, the Multi-feature Topic Model (MTM) technique by Xia et al. ( 2017b ), which implements a supervised version of LDA to create a bug triaging approach. The other 19 papers applied existing modifications of LDA proposed by others (third column in Table  4 ). For example, Hu and Wong ( 2013 ) used the Citation Influence Topic Model (CITM), developed by Dietz et al. ( 2007 ), which models the influence of citations in a collection of publications.

The other topic modeling technique, LSI (Deerwester et al. 1990 ), was published in 1990, before LDA which was published in 2003. LSI is an information extraction technique that reduces the dimensionality of a term-document matrix using a reduction factor k (number of topics) (Deerwester et al. 1990 ). Compared to LDA, LDA follows a generative process that is statistically more rigorous than LSI (Blei et al. 2003b ; Griffiths and Steyvers 2004 ). From the 16 papers that used LSI, seven papers compared this technique to others:

One paper (Rosenberg and Moonen 2018 ) compared LSI with other two dimensionality reduction techniques: Principal Component Analysis (PCA) (Wold et al. 1987 ) and Non-Negative Matrix Factorization (NMF) (Lee and Seung 1999 ). The authors applied these models to automatically group log messages of continuous deployment runs that failed for the same reasons.

Four papers applied LDA and LSI at the same time to compare the performance of these models to Vector Space Model (VSM) (Salton et al. 1975 ), an algebraic model for information extraction. These studies supported documentation (De Lucia et al. 2014 ); bug handling (Thomas et al. 2013 ; Tantithamthavorn et al. 2018 ); and maintenance tasks (Abdellatif et al. 2019 )).

Regarding the other two papers, Binkley et al. ( 2015 ) compared LSI to Query likelihood LDA (QL-LDA) and other information extraction techniques to check the best model for locating features in source code; and Liu et al. ( 2020 ) compared LSI and LDA to Generative Vector Space Model (GVSM), a deep learning technique, to select the best performer model for documentation traceability to source code in multilingual projects.

5.2.2 Supported Tasks

As mentioned before, we aimed to understand why topic modeling was used in papers, e.g., if topic modeling was used to develop techniques to support specific software engineering tasks, or if it was used as a data analysis technique in exploratory studies to understand the content of large amounts of textual data. We found that the majority of papers aimed at supporting a particular task, but 21 papers (see Table  5 ) used topic modeling in empirical exploratory and descriptive studies as a data analysis technique.

We extracted the software engineering tasks described in each study (e.g., bug localization, bug assignment, bug triaging) and then grouped them into eight more generic tasks (e.g., bug handling) considering typical software development activities such as requirements, documentation and maintenance (Leach 2016 ). The specific tasks collected from papers are available online 1 . Note that we kept “Bug handling” and “Refactoring” separate rather than merging them into maintenance because of the number of papers (bug handling) and the cross-cutting nature (refactoring) in these categories. Each paper was related to one of these tasks:

Architecting: tasks related to architecture decision making, such as selection of cloud or mash-up services (e.g., Belle et al. ( 2016 ));

Bug handling: bug-related tasks, such as assigning bugs to developers, prediction of defects, finding duplicate bugs, or characterizing bugs (e.g., Naguib et al. ( 2013 ));

Coding: tasks related to coding, e.g., detection of similar functionalities in code, reuse of code artifacts, prediction of developer behaviour (e.g., Damevski et al. ( 2018 ));

Documentation: support software documentation, e.g., by localizing features in documentation, automatic documentation generation (e.g., Souza et al. ( 2019 ));

Maintenance: software maintenance-related activities, such as checking consistency of versions of a software, investigate changes or use of a system (e.g., Silva et al. ( 2019 ));

Refactoring: support refactoring, such as identifying refactoring opportunities and removing bad smell from source code (e.g., Bavota et al. ( 2014b ));

Requirements: related to software requirements evolution or recommendation of new features (e.g., Galvis Carreno and Winbladh ( 2012 ));

Testing: related to identification or prioritization of test cases (e.g., Thomas et al. ( 2014 )).

Table  5 groups papers based on the topic modeling technique and the purpose. Few papers applied topic modeling to support Testing (three papers) and Refactoring (three papers). Bug handling is the most frequent supported task (33 papers). From the 21 exploratory studies, 13 modeled topics from developer communication to identify developers’ information needs: 12 analyzed posts on Stack Overflow, a Q&A website for developers (Chatterjee et al. 2019 ; Bajaj et al. 2014 ; Ye et al. 2017 ; Bagherzadeh and Khatchadourian 2019 ; Ahmed and Bagherzadeh 2018 ; Barua et al. 2014 ; Rosen and Shihab 2016 ; Zou et al. 2017 ; Chen et al. 2019 ; Han et al. 2020 ; Abdellatif et al. 2020 ; Haque and Ali Babar 2020 ) and one paper analyzed blog posts (Pagano and Maalej 2013 ). Regarding the other eight exploratory studies, three papers investigated web search queries to also identify developers’ information needs (Xia et al. 2017a ; Bajracharya and Lopes 2009 ; 2012 ); four papers investigated end user documentation to analyse users’ feedback on mobile apps (Tiarks and Maalej 2014 ; El Zarif et al. 2020 ; Noei et al. 2018 ; Hu et al. 2018 ); and one paper investigated historical “bug” reports of NASA systems to extract trends in testing and operational failures (Layman et al. 2016 ).

5.2.3 Types of Contribution

For each study, we identified what type of contribution it presents based on the study goal. We used three types of contributions (“Approach”, “Exploration” and “Comparison”, as described below) by analyzing the research questions and main results of each study. A study could contribute either an “Approach” or an “Exploration”, while “Comparison” is orthogonal, i.e., a study that presents a new approach could present a comparison of topic models as part of this contribution. Similarly, a comparison of topic models can also be part of an exploratory study.

Approach: a study develops an approach (e.g., technique, tool, or framework) to support software engineering activities based on or with the support of topic models. For example, Murali et al. ( 2017 ) developed a framework that applies LDA to Android API methods to discover types of API usage errors, while Le et al. ( 2017 ) developed a technique (APRILE+) for bug localization which combines LDA with a classifier and an artificial neural network.

Exploration: a study applies topic modeling as the technique to analyze textual data collected in an empirical study (in contrast to for example open coding). Studies that contributed an exploration did not propose an approach as described in the previous item, but focused on getting insights from data. For example, Barua et al. ( 2014 ) applied LDA to Stack Overflow posts to discover what software engineering topics were frequently discussed by developers; Noei et al. ( 2018 ) explored the evolution of mobile applications by applying LDA to app descriptions, release notes, and user reviews.

Comparison: the study (that can also contribute with an “Approach” or an “Exploration”) compares topic models to other approaches. For example, Xia et al. ( 2017b ) compared their bug triaging approach (based on the so called Multi-feature Topic Model - MTM) with similar approaches that apply machine learning (Bugzie (Tamrawi et al. 2011 )) and SVM-LDA (combining a classifier with LDA (Somasundaram and Murphy 2012 )). On the other hand, De Lucia et al. ( 2014 ) compared LDA and LSI to define guidelines on how to build effective automatic text labeling techniques for program comprehension.

From the papers that contributed an approach , twenty-two combined a topic modeling technique with one or more other techniques applied for text mining:

Information extraction (e.g., VSM) (Nguyen et al. 2012 ; Zhang et al. 2018 ; Chen et al. 2020 ; Thomas et al. 2013 ; Fowkes et al. 2016 );

Classification (e.g., Support Vector Machine - SVM) (Hindle et al. 2013 ; Le et al. 2017 ; Liu et al. 2017 ; Demissie et al. 2020 ; Zhao et al. 2020 ; Shimagaki et al. 2018 ; Gopalakrishnan et al. 2017 ; Thomas et al. 2013 );

Clustering (e.g., K-means) (Jiang et al. 2019 ; Cao et al. 2017 ; Liu et al. 2017 ; Zhang et al. 2016 ; Altarawy et al. 2018 ; Demissie et al. 2020 ; Gorla et al. 2014 );

Structured prediction (e.g., Conditional Random Field - CRF) (Ahasanuzzaman et al. 2019 );

Artificial neural networks (e.g., Recurrent Neural Network - RNN) (Murali et al. 2017 ; Le et al. 2017 );

Evolutionary algorithms (e.g., Multi-Objective Evolutionary Algorithm - MOEA) (Blasco et al. 2020 ; Pérez et al. 2018 );

Web crawling (Nabli et al. 2018 ).

Pagano and Maalej ( 2013 ) was the only study that contributed an exploration that combined LDA with another text mining technique. To analyze how developer communities use blogs to share information, the authors applied LDA to extract keywords from blog posts and then analyzed related “streams of events” (commit messages and releases by time in relation to blog posts), which were created with Sequential pattern mining.

Regarding comparisons we found that (1) 13 out of the 63 papers that contribute an approach also include some form of comparison, and (2) ten out of the 48 papers contribute an exploration also include some form of comparison. We discuss comparisons in more detail below in Section  6.1.2

5.3 RQ2: Topic Model Inputs

In this section we first discuss the type of data (Section  5.3.1 ). Then we discuss the actual textual documents used for topic modeling (Section  5.3.2 ). Finally, we describe which model parameters were used (Section  5.3.3 ) to configure models.

5.3.1 Types of Data

Types of data help us describe the textual software engineering content that has been analyzed with topic modeling. We identified 12 types of data in selected papers as shown in Table  6 . In some papers we identified two or three of these types of data; for example, the study of Tantithamthavorn et al. ( 2018 ) dealt with issue reports, log information and source code.

Source code (37 occurrences), issue/bug reports (22 occurrences) and developer communication (20 occurrences) were the most frequent types of data used. Seventeen papers used two to four types of data in their topic modeling technique; twelve of these papers used a combination of source code with another type of data. For example, Sun et al. ( 2015 ) generated topics from source code and developer communication to support software maintenance tasks, and in another study, Sun et al. ( 2017 ) used topics found in source code and commit messages to assign bug-fixing tasks to developers.

5.3.2 Documents

A document refers to a piece of textual data that can be longer or shorter, such as a requirements document or a single e-mail subject. Documents are concrete instances of the types of data discussed above. Figure  3 shows documents (per type of data) and how often we found them in papers. The most frequent documents are bug reports (12 occurrences), methods from source code (9 occurrences), Q&A posts (9 occurrences) and user reviews (8 occurrences).

figure 3

Documents (leaves in the figure) by type of data (nodes in the figure)

We also analyzed document length and found the following:

In general, papers described the length of documents in number of words, see Table  7 . Footnote 2 On the other hand, two papers (Moslehi et al. 2016 , 2020 ) described their documents’ length in minutes of screencast transcriptions (videos with one to ten minutes, no information about the size of transcripts). Sixteen papers mentioned the actual length of the documents, see Table  7 . Ten papers that described the actual document length did that when describing the data used for topic modeling; four papers discussed document length while describing results; and one mentioned document length as a metric for comparing different data sources;

Most papers (80 out of 111) did not mention document length and also do not acknowledge any limitations or the impact of document length on topics.

Fifteen papers did not mention the actual document length, but at some point acknowledge the influence of document length on topic modeling. For example, Abdellatif et al. ( 2019 ) mentioned that the documents in their data set were “not long”. Similarly, Yan et al. ( 2016b ) did not mention the length of the bug reports used but discussed the impact of the vocabulary size of their corpus on results. Moslehi et al. ( 2018 ) mentioned document length as a limitation and acknowledge that using LDA on short documents was a threat to construct validity. According to these authors, using techniques specific for short documents could have improved the outcomes of their topic modeling.

5.3.3 Model Parameters

Topic models can be configured with parameters that impact how topics are generated. For example, LDA has typically been used with symmetric Dirichlet priors over 𝜃 (document-topic distributions) and ϕ (topic-word distributions) with fixed values for α and β (Wallach et al. 2009 ). Wallach et al. ( 2009 ) explored the robustness of a topic model with asymmetric priors over 𝜃 (i.e., varying values for α ) and a symmetric prior (fixed value for β ) over ϕ . Their study found that such topic model can capture more distinct and semantically-related topics, i.e., the words in clusters are more distinct. Therefore, we checked which parameters and values were used in papers. Overall, we found the following:

Eighteen of the 111 papers do not mention parameters (e.g., number of topics k , hyperparameters α and β ). Thirteen of these papers use LDA or an LDA-based technique, four papers use LSI, while (Liu et al. 2020 ) use LDA and LSI.

The remaining 93 papers mention at least one parameter. The most frequent parameters discussed were k , α and β :

Fifty-eight papers mentioned actual values for k , α and β ;

Two papers mentioned actual values for α and β , but no values for k ;

Twenty-nine papers included actual values for k but not for α and β ;

Thirty-two (out of 58) papers mentioned other parameters in addition to k , α and β . For example, Chen et al. ( 2019 ) applied L2H (in comparison to LLDA), which uses the hyperparameters γ 1 and γ 2 ;

One paper (Rosenberg and Moonen 2018 ) that applied LSI, mentioned the parameter “similarity threshold” rather than k , α and β .

We then had a closer look at the 60 papers that mentioned actual values for hyperparameters α and β :

α based on k : The most frequent setting (29 papers) was α = 50/ k and β = 0.01 (i.e., α was depending on the number of topics, a strategy suggested by Steyvers and Griffiths ( 2010 ) and Wallach et al. ( 2009 )). These values are a default setting in Gibbs Sampling implementations for LDA such as Mallet. Footnote 3

Fixed α and β : Five papers fixed 0.01 for both hyperparameters, as suggested by Hoffman et al. ( 2010 ). Another eight papers fixed 0.1 for both hyperparameters, a default setting in Stanford Topic Modeling Toolbox (TMT); Footnote 4 and three other papers fixed α = 0.1 and β = 1 (these three studies applied RTM).

Varying α or β : Four papers tested different values for α , where two of these papers also tested different values for β ; and one paper varied β but fixed a value for α .

Optimized parameters : Four papers obtained optimized values for hyperparameters (Sun et al. 2015 ; Catolino et al. 2019 ; Yang et al. 2017 ; Zhang et al. 2018 ). These papers applied LDA-GA (as proposed by Panichella et al. ( 2013 )) which, based on genetic algorithms; finds the best values for LDA hyperparameters. In regards to the actual values chosen for optimized hyperparameters, Catolino et al. ( 2019 ) did not mention the values for hyperparameters; Sun et al. ( 2015 ) and Yang et al. ( 2017 ) mentioned only the values used for k ; and Zhang et al. ( 2018 ) described the values for k , α and β .

Regarding the values for k we observed the following:

The 90 papers that mentioned values for k modeled three (Cao et al. 2017 ) to 500 (Li et al. 2018 ; Lukins et al. 2010 ; Chen et al. 2017 ) topics;

Twenty-four (out of 90) papers mentioned that a range of values for k was tested in order to check the performance of the technique (e.g., Xia et al. ( 2017b )) or as a strategy to select the best number of topics (e.g., Layman et al. ( 2016 ));

Although the remaining 66 (out of 90) papers mentioned a single value used for k , most of them acknowledged that had tried several number of topics or used the number of topics suggested by other studies.

As can be seen in Table  7 , there is no common trend of what values for hyperparameter or k depending on the document or document length.

5.4 RQ3: Pre-processing Steps

Thirteen of the papers did not mention what pre-processing steps were applied to the data before topic modeling. Seven papers only described how the data analyzed were selected, but not how they were pre-processed. Table  8 shows the pre-processing steps found in the remaining 91 papers. Each of these papers mentioned at least one of these steps.

Removing noisy content (76 occurrences), Stemming terms (61 occurrences) and Splitting terms (33 occurrences) were the most used pre-processing steps. The least frequent pre-processing step (Resolving negations) was found only in the studies of Noei et al. ( 2019 ) and Noei et al. ( 2018 ). Resolving synonyms and Expanding contractions were also less frequent, with three occurrences each.

Table  9 shows the types of noise removal in papers and their frequency. Most of the papers that described pre-processing steps removed stop words (76 occurrences). Stop words are the most common words in a language, such as “a/an” and “the” in English. Removing stop words allows topic modeling techniques to focus on more meaningful words in the corpus (Miner et al. 2012 ). Eight papers mentioned the stop words list used: Layman et al. ( 2016 ) and Pettinato et al. ( 2019 ) used the SMART stop words list; Footnote 5 Martin et al. ( 2015 ) and Hindle et al. ( 2013 ) used the Natural Language Toolkit English stop words list; Footnote 6 Bagherzadeh and Khatchadourian ( 2019 ), Ahmed and Bagherzadeh ( 2018 ) and Yan et al. ( 2016b ) used the Mallet stop words list; Footnote 7 and Mezouar et al. ( 2018 ) used the Moby stop words list. Footnote 8

As can be seen in Table  9 , some papers removed words based on the frequency of their occurrence (most or least frequent terms) or length (words shorter than four, three or two letters or long terms). Other papers removed long paragraphs. For example, Henß et al. ( 2012 ) removed paragraphs longer than 800 characters because most paragraphs in their data set were shorter than that. We also found two papers that removed short documents: Gorla et al. ( 2014 ) removed documents with fewer than ten words, and Palomba et al. ( 2017 ) removed documents with fewer than three words. The concept of non-informative content depends on the context of each paper. In general, it refers to any data considered not relevant for the objective of the study. For example, Choetkiertikul et al. ( 2017 ), which aimed at predicting bugs in issue reports, removed issues that took too much time to be resolved. Noei et al. ( 2019 ) and Fu et al. ( 2015 ) removed content (end user reviews and commit messages) that did not describe feedback or cause of change.

5.5 RQ4: Topic Naming

Topic naming is about assigning labels (names) to topics (word clusters) to give the clusters a human-understandable meaning. Seventy-five papers (out of 111) did not mention whether or how topics were named. These papers only used the word clusters for analysis, but did not require a name. For example, Xia et al. ( 2017a ) and Canfora et al. ( 2014 ) did not name topics, but mapped the word clusters to the documents (search queries and source code comments) used as input for topic modeling. These papers used the probability of a document to belong to a topic ( 𝜃 ) to associate a document to the topic with the highest probability.

From the 36 papers (out of 111) that mentioned topic naming (see Table  10 ), we identified three ways of how they named topics:

Automated: Assigning names to word clusters without human intervention;

Manual: Manually checking the meaning and the combination of words in cluster to “deduct” a name, sometimes validated with expert judgment;

Manual & Automated: Mix of manual and automated; e.g., topics are manually labeled for one set of clusters to then train a classifier for naming another set of clusters.

Most of the papers (30 out of 36) assigned one name to one topic. However, we identified six papers that used one name for multiple topics (Hindle et al. 2013 ; Pagano and Maalej 2013 ; Bajracharya and Lopes 2012 ; Rosen and Shihab 2016 ) or labeled a topic with multiple names (Zou et al. 2017 ; Gao et al. 2018 ). Two of the papers (Hindle et al. 2013 ; Bajracharya and Lopes 2012 ) that assigned one name to multiple topics used predefined labels, and in the other two papers (Pagano and Maalej 2013 ; Rosen and Shihab 2016 ) authors interpreted words in the clusters to deduct names.

Regarding the papers that assigned multiple names to a topic, Zou et al. ( 2017 ) assigned no, one or more names, depending on how many words in the predefined word list matched words in clusters. Gao et al. ( 2018 ) used an automated approach to label topics with the three most relevant phrases and sentences from the end user reviews inputted to their topic model. The relevance of phrases and sentences were obtained with the metrics Semantic and Sentiment scores proposed by these authors.

6 Discussion

6.1 rq1: topic modeling techniques, 6.1.1 summary of findings.

LDA is the most frequently used topic model. Almost all papers (95 out of 111) applied LDA or a LDA-based technique, while nine papers applied LSI to identify topics and seven papers used LDA and LSI. Regarding the papers that used LDA-based techniques, eleven (out of 30) proposed their own LDA-based technique (Fu et al. 2015 ; Nguyen et al. 2011 ; Liu et al. 2017 ; Cao et al. 2017 ; Panichella et al. 2013 ; Yan et al. 2016a ; Xia et al. 2017b ; Nguyen et al. 2012 ; Damevski et al. 2018 ; Gao et al. 2018 ; Rao and Kak 2011 ). This may indicate that the LDA default implementation may not be adequate to support specific software engineering tasks or extract meaningful topics from all types of data. We discuss more about topic modeling techniques and their inputs in Section  6.2.2 . Furthermore, we found that topic modeling is used to develop tools and methods to support software engineers and concrete tasks (the most frequently supported task we found was bug handling), but also as a data analysis technique for textual data to explore empirical questions (see for example the “oldest” paper in our sample published in 2009 (Bajracharya and Lopes 2009 )).

One aspect that we did not specifically address in this review, but which impacts the applicability of topics models is their computational overhead. Computational overhead refers to processing time and computational resources (e.g., memory, CPU) required for topic modeling. As discussed by others, topic modeling can be computational intensive (Hoffman et al. 2010 ; Treude and Wagner 2019 ; Agrawal et al. 2018 ). However, we found that only few papers (seven out of 111) mentioned computational overhead at all. From these seven papers, five mentioned processing time (Bavota et al. 2014b ; Zhao et al. 2020 ; Luo et al. 2016 ; Moslehi et al. 2016 ; Chen et al. 2020 ), one paper mentioned computational requirements and some processing times (e.g., processor, data pre-processing time, LDA processing time and clustering processing time), and one paper only mention that their technique was processed in “few seconds” (Murali et al. 2017 ). Hence, based on the reviewed studies we cannot provide broader insights into the practical applicability and potential constraints of topic modeling based on the computational overhead.

6.1.2 Comparative Studies

As mentioned in Sections  5.2.1 and  5.2.3 , we identified studies that used more than one topic modeling technique and compared their performance. In detail, we found studies that (1) compared topic modeling techniques to information extraction techniques, such as Vector Space Model (VSM), an algebraic model (Salton et al. 1975 ) (see Table  11 ), (2) proposed an approach that uses a topic modeling technique and compared it to other approaches (which may or may not use topic models) with similar goals (see Table  12 ), and (3) compared the performance of different settings for a topic modeling technique or a newly proposed approach that utilizes topic models (see Table  13 ). In column “Metric” of Tables  11 ,  12 and  13 the metrics show the metrics used in the comparisons to decide which techniques performed “better” (based on the metrics’ interpretation). Metrics in bold were proposed for or adapted to a specific context (e.g., SCORE and Effort reduction), while the other metrics are standard NLP metrics (e.g., Precision, Recall and Perplexity). Details about the metrics used to compare the techniques are provided in Appendix  A.2 - Metrics Used in Comparative Studies.

As shown in Table  11 , ten papers compared topic modeling techniques to information extraction techniques. For example, Rosenberg and Moonen ( 2018 ) compared LSI with two other dimensionality reduction techniques (PCA and NMF) to group log messages of failing continuous deployment runs. Nine out of these ten papers presented explorations, i.e., studies experimented with different models to discuss their application to specific software engineering tasks, such as bug handling, software documentation and maintenance. Thomas et al. ( 2013 ) on the other hand experimented with multiple models to propose a framework for bug localization in source code that applies the best performing model.

Four papers in Table  11 (De Lucia et al. 2014 ; Tantithamthavorn et al. 2018 ; Abdellatif et al. 2019 ; Thomas et al. 2013 ) compared the performance of LDA, LSI and VSM with source code and issue/bug reports. Except for De Lucia et al. ( 2014 ), these studies applied Top-k accuracy (see Appendix  A.2 - Metrics Used in Comparative Studies) to measure the performance of models, and the best performing model was VSM. Tantithamthavorn et al. ( 2018 ) found that VSM achieves both the best Top-k performance and the least required effort for method-level bug localization. Additionally, according to De Lucia et al. ( 2014 ), VSM possibly performed better than LSI and LDA due to the nature of the corpus used in their study: LDA and LSI are ideal for heterogeneous collections of documents (e.g., user manuals from different systems), but in De Lucia et al. ( 2014 ) study each corpus was a collection of code classes from a single software system.

Ten studies proposed an approach that uses a topic modeling technique and compared it to similar approaches (shown in Table  12 ). In column “Approaches compared” of Table  12 , the approach in bold is the one proposed by the study (e.g., Cao et al. 2017 ) or the topic modeling technique used in their approach (e.g., Thomas et al. 2014 ). All newly proposed approaches were the best performing ones according to the metrics used.

In addition to the papers mentioned in Tables  11 and  12 , four papers compared the performance of different settings for a topic modeling technique or tested which topic modeling technique works best in their newly proposed approach (see Table  13 ). Biggers et al. ( 2014 ) offered specific recommendations for configuring LDA when localizing features in Java source code, and observed that certain configurations outperform others. For example, they found that commonly used heuristics for selecting LDA hyperparameter values ( beta = 0.01 or beta = 0.1) in source code topic modeling are not optimal (similar to what has been found by others, see Section  3.2 ). The other three papers (Chen et al. 2014 ; Fowkes et al. 2016 ; Poshyvanyk et al. 2012 ) developed approaches which were tested with different settings (e.g., the approach applying LDA or ASUM (Chen et al. 2014 )).

Regarding the datasets used by comparative studies, only Rao and Kak ( 2011 ) used a benchmarking dataset (iBUGS). Most of the comparative studies (13 out of 24) used source code or issue/bug reports from open source software, which are subject to evolution. The advantage of using benchmarking datasets rather than “living” datasets (e.g., an open source Java system) is that its data will be static and the same across studies. Additionally, data in benchmarking datasets are usually curated. This means that the results of replicating studies can be compared to the original study when both used the same benchmarking dataset.

Finally, we highlight that each of the above mentioned comparisons has a specific context. This means that, for example, the type of data analyzed (e.g., Java classes), the parameter setting (e.g., k = 50), the goal of the comparison (e.g., to select the best model for bug localization or for tracing documentation in source code) and pre-processing (e.g., stemming and stop word removal) were different. Therefore, it is not possible to “synthesize” the results from the comparisons across studies by aggregating the different comparisons in different papers, even for studies that appear to have similar goals or use the same topic modeling techniques, such as comparing the same models with similar types of data (such as Tantithamthavorn et al. 2018 and Abdellatif et al. 2019 ).

6.2 RQ2: Inputs to Topic Models

6.2.1 summary of findings.

Source code, developer communication and issue/bug reports were the most frequent types of data used for topic modeling in the reviewed papers. Consequently, most of the documents referred to individual or groups of functions or methods, individual Q&A posts, or individual bug reports; another frequent document was an individual user review (more discussions are in Section  6.2.3 ). We also found that few papers (16 out of 111) mentioned the actual length of documents used for topic modeling (we discuss this more in Section  6.2.2 ).

Regarding modeling parameters, most of the papers (93 out of 111) explicitly mentioned the configuration of at least one parameter, e.g., k , α or β for LDA. We observed that the setting α = 50/ k and β = 0.01 (asymmetric α and symmetric β ) as suggested by Steyvers and Griffiths ( 2010 ) and Wallach et al. ( 2009 ) was frequently used (28 out of 93 papers). Additionally, papers that applied LDA mostly used the default parameters of the tools used to implement LDA (e.g., Mallet 3 with α = 50/ k and β = 0.01 as default). This finding is similar to what has been reported by others, e.g., according to another review by Agrawal et al. ( 2018 ), LDA is frequently applied “as is out-of-the-box” or with little tuning. This means that studies may rely on the default settings of the tools used with their topic modeling technique, such as Mallet and TMT, rather than try to optimize parameters.

6.2.2 Documents and Parameters for Topic Models

Short texts : According to Lin et al. ( 2014 ), topic models such as LDA have been widely adopted and successfully used with traditional media like edited magazine articles. However, applying LDA to informal communication text such as tweets, comments on blog posts, instant messaging, Q&A posts, may be less successful. Their user-generated content is characterized by very short document length, a large vocabulary and a potentially broad range of topics. As a consequence, there are not enough words in a document to create meaningful clusters, compromising the performance of the topic modeling. This means that probabilistic topic models such as LDA perform sub-optimally when applied “as is” with short documents even when hyperparameters ( α and β in LDA) are optimized (Lin et al. 2014 ). In our sample there were only two papers that mentioned the use of a LDA-based technique specifically for short documents (Hu et al. 2019 ; Hu et al. 2018 ). Hu et al. ( 2019 ) and Hu et al. ( 2018 ) applied Twitter-LDA with end user reviews. Furthermore, Moslehi et al. ( 2018 ) used a weighting algorithm in documents to generate topics with more relevant words, they also acknowledge that the use of a short text technique could have improved their topic model.

As shown in Table  7 , few papers mentioned the actual length of documents. Considering a single document from a corpus, we observed that most papers potentially used short texts (all documents found in papers are shown in Fig.  3 ). For example, papers used an individual search query (Xia et al. 2017a ), an individual Q&A post (Barua et al. 2014 ), an individual user review (Nayebi et al. 2018 ), or an individual commit message (Canfora et al. 2014 ) as a document. Among the papers that mentioned document length, the shortest documents were an individual commit message (9 to 20 words) (Canfora et al. 2014 ) and an individual method (14 words) (Tantithamthavorn et al. 2018 ). Both studies applied LDA.

Two approaches to improve the performance of LDA when analyzing short documents are pooling and contextualization (Lin et al. 2014 ). Pooling refers to aggregating similar (e.g., semantically or temporally) documents into a single document (Mehrotra et al. 2013 ). For example, among the papers analysed, Pettinato et al. ( 2019 ) used temporal pooling and combined short log messages into a single document based on a temporal order. Contextualization refers to creating subsets of documents according to a type of context; considering tweets as documents, the type of context can refer to time, user and hashtags associated with tweets (Tang et al. 2013 ). For example, Weng et al. ( 2010 ) combined all the individual tweets of an author into one pseudo-document (rather than treating each tweet as a document). Therefore, with the contextualization approach, the topic model uses word co-occurrences at a context level instead of at the document level to discover topics.

Hyperparameters Table  14 shows the hyperparameter settings and types of data of the papers that mentioned the value of at least one model parameter. In Table  14 we also highlight the topic modeling techniques used. Note that some topic modeling techniques (e.g., RTM) can receive more parameters that the ones mentioned in Table  14 (e.g., number of documents, similarity thresholds); all parameters mentioned in papers are available online in the raw data of our study 1 . When comparing hyperparameter settings, topic modeling techniques and types of data, we observed the following:

Papers that used LDA-GA, an LDA-based technique that optimizes hyperparameters with Genetic algorithms, applied it to data from developer documentation or source code;

LDA was used with all three types of hyperparameter settings across studies. The most common setting was α based on k for developer communication and source code;

Most of the LDA-based techniques applied fixed values for α and β .

Most of the papers that applied only LSI as the topic modeling technique did not mention hyperparameters. As LSI is a model simpler than LDA, it generally requires the number of topics k . For example, a paper that applied LSI to source code mentioned α and k (Poshyvanyk et al. 2012 ).

Number of topics By relating the type of data to the number of topics, we aimed at finding whether the choice of the number of topics is related to the data used in the topic modeling techniques (see also Table  7 ). However, the number of topics used and data in the studies are rather diverse. Therefore, synthesizing practices and offering insights from previous studies on how to choose the number topics is rather limited.

From the 90 papers that mentioned number of topics ( k ), we found that 66 papers selected a specific number of topics (e.g., based on previous works with similar data or addressing the same task), while 24 papers used several numbers of topics (e.g., Yan et al. ( 2016b ) used 10 to 120 topics in steps of 10). To provide an example of how the number of topics differed even when the same type of data was analyzed with the same topic modeling technique, we looked at studies that applied LDA in textual data from developer communication (mostly Q&A posts) to propose an approach to support documentation. For these papers we found one paper that did not mention k (Henß et al. 2012 ), one paper that modeled different numbers of topics ( k = 10,20,30) (Asuncion et al. 2010 ), one paper that modeled k = 15 (Souza et al. 2019 ) and another paper that modeled k = 40 (Wang et al. 2015 ). This illustrates that there is no common or recommended practice that can be derived from the papers.

Some papers mentioned that they tested several numbers of topics before selecting the most appropriate value for k (in regards to studies’ goals) but did not mention the range of values tested. In regards to papers that mentioned such range, we identified four studies (Nayebi et al. 2018 ; Chen et al. 2014 ; Layman et al. 2016 ; Nabli et al. 2018 ) that tested several values for k and used perplexity (see details in Appendix  A.2 - Metrics Used in Comparative Studies) of models to evaluate which value of k generated the best performing model; three studies (Zhao et al. 2020 ; Han et al. 2020 ; El Zarif et al. 2020 ) also selected the number of topics after testing several values for k ; however they used topic coherence (Röder et al. 2015 ) to evaluate models. One paper (Haque and Ali Babar 2020 ) used both perplexity and topic coherence to select a value for k . Metrics of topic coherence score the probability of a pair of words from the resulted word clusters being found together in (a) external data sources (e.g., Wikipedia pages) or (b) in the documents used by the topic model that generated those word clusters (Röder et al. 2015 ).

6.2.3 Supported Tasks, Types of Data and Types of Contribution

We looked into the relationship between the tasks supported by papers, the type of data used and the types of contributions (see Table  15 ). We observed the following:

Source code was a frequent type of data in papers; consequently it appeared for almost all supported tasks, except for exploratory studies;

Considering exploratory studies, most papers used developer communication (13 out of 21), followed by search queries and end user communication (three papers each);

Papers that supported bug handling mostly used issue/bug reports, source code and end user communication;

Log information was used by papers that supported maintenance, bug handling, and coding;

Considering the papers that supported documentation, three used transcript texts from speech;

From the four papers related to the type of data developer documentation, two supported architecting tasks and the other two, documentation tasks.

Regarding the type of data, URLs and transcripts were only used in studies that contributed an approach.

We found that most of the exploratory studies used data that is less structured. For example, developer communication, such as Q&A posts and conversation threads generally do not follow a standardized template. On the other hand, issue reports are typically submitted through forms which enforces a certain structure.

6.3 RQ3: Data Pre-processing

6.3.1 summary of findings.

Most of the papers (91 out of 111) pre-processed the textual data before topic modeling. Removing noisy content was the most frequent pre-processing step (as typical for natural language processing), followed by stemming and splitting words. Miner et al. ( 2012 ) consider tokenizing as one of the basic data pre-processing steps in text mining. However, in comparison to other basic pre-processing steps such as stemming, splitting words and removing noise, tokenizing was not frequently found in papers (it was at least not mentioned in papers).

Eight papers (Henß et al. 2012 ; Xia et al. 2017b ; Ahasanuzzaman et al. 2019 ; Abdellatif et al. 2019 ; Lukins et al. 2010 ; Tantithamthavorn et al. 2018 ; Poshyvanyk et al. 2012 ; Binkley et al. 2015 ) tested how pre-processing steps affected the performance of topic modeling or topic model-based approaches. For example, Henß et al. ( 2012 ) tested several pre-processing steps (e.g., removing stop words, long paragraphs and punctuation) in e-mail conversations analyzed with LDA. They found that removing such content increased LDA’s capability to grasp the actual semantics of software mailing lists. Ahasanuzzaman et al. ( 2019 ) proposed an approach which applies LDA and Conditional Random Field (CRF) to localize concerns in Stack Overflow posts. The authors did not incorporate stemming and stop words removal in their approach because in preliminary tests these pre-processing steps decreased the performance of the approach.

6.3.2 Pre-processing Different Types of Data

Table  16 shows how different types of data were pre-processed. We observed that stemming, removing noise, lowercasing, and splitting words were commonly used for all types of data. Regarding the differences, we observed the following:

For developer communication there were specific types of noisy content that was removed: URLs, HTML tags and code snippets. This might have happened because most of the papers used Q&A posts as documents, which frequently contain hyperlinks and code examples;

Removing non-informative content was frequently applied to end user communication and end user documentation;

Expanding contracted terms (e.g., “didn’t” to “did not”) were applied to end user communication and issue/bug reports;

Removing empty documents and eliminating extra white spaces were applied only in end user communication. Empty documents occurred in this type of data because after the removal of stop words no content was left (Chen et al. 2014 );

For source code there was a specific noise to be removed: program language specific keywords (e.g., “public”, “class”, “extends”, “if”, and “while”).

Table  16 shows that splitting words, stop words removal and stemming were frequently applied to source code and most of these studies (15) applied these three steps at the same time. Studies that performed these pre-processing steps to source code mostly used methods, classes, or comments in classes/methods as documents. For example, Silva et al. ( 2016 ) who applied LDA, performed these three pre-processing steps in classes from two open source systems using TopicXP (Savage et al. 2010 ). TopicXP is a Eclipse plug-in that extracts source code, pre-process it and executes LDA. This plug-in implements splitting words, stop words removal and stemming.

Splitting words was the most frequent pre-processing step in source code. Studies used this step to separate Camel Cases in methods and classes (e.g., the class constructor InvalidRequestTest produces the terms “invalid”, “request” and “test”). For example, Tantithamthavorn et al. ( 2018 ) compared LDA, LSI and VSM testing different combinations of pre-processing steps to the methods’ identifiers inputted to these techniques. The best performing approach was VSM with splitting words, stop words removal and stemming.

Removing stop words in source code refer to the exclusion of the most common words in a language (e.g., “a/an” and “the” in English), as in studies that used other types of data. Removing stop words in source code is also different from removing programming language keywords and studies mentioned these as separate steps. Lukins et al. ( 2010 ), for example, tested how removing stop words from their documents (comments and identifiers of methods) affected the topics generated by their LDA-based approach. They found that this step did not improve the results substantially.

As mentioned in Section  5.4 , stemming is the process of normalizing words into their single forms by identifying and removing prefixes, suffixes and pluralisation (e.g., “development”, “developer”, “developing” become “develop”). Regarding stemming in source code, papers normalized identifiers of classes and methods, comments related to classes and methods, test cases or a source code file. Three papers tested the effect of this pre-processing step in the performance of their techniques (Tantithamthavorn et al. 2018 ; Poshyvanyk et al. 2012 ; Binkley et al. 2015 ), and one of these papers also tested removing stop words and splitting words (Tantithamthavorn et al. 2018 ). Poshyvanyk et al. ( 2012 ) tested the effect of stemming classes in the performance of their LSI-based approach. The authors concluded that stemming can positively impact features localization by producing topics (“concept lattices” in their study) that effectively organize the results of searches in source code. Binkley et al. ( 2015 ) compared the performance of LSI, QL-LDA and other techniques. They also tested the effects of stemming (with two different stemmers: Porter Footnote 9 and Krovetz Footnote 10 ) and non-stemming methods from five open source systems. These authors found that they obtained better performances in terms of models’ Mean Reciprocal Rank (MRR, details in Appendix  A.2 - Metrics Used in Comparative Studies) with non-stemming.

Additionally, we found that even though some papers used the same type of data, they pre-processed data differently since they had different goals and applied different techniques. For example, Ye et al. ( 2017 ), Barua et al. ( 2014 ) and Chen et al. ( 2019 ) used developer communication (Q&A posts as documents). Ye et al. ( 2017 ) and Barua et al. ( 2014 ) removed stop words, code snippets and HTML tags, while Barua et al. ( 2014 ) also stemmed words. On the other hand, Chen et al. ( 2019 ) removed stop words and the least and the most frequent words, and identified bi-grams. Some studies considered the advice on data pre-processing from previous studies (e.g., Chen et al. 2017 ; Li et al. 2018 ), while others adopted steps that are commonly used in NLP, such as noise removal and stemming (Miner et al. 2012 ) (e.g., Demissie et al. 2020 ). This means that the choice of pre-processing steps do not only depend on the characteristics of the type of data inputted to topic modeling techniques.

6.4 RQ4: Assigning Names to Topics

Most papers did not mention if or how they named topics. The majority of papers that explicitly assigned names to topics (27 out of 36) used a manual approach and relied on human judgment (researchers’ interpretation) of words in clusters. One paper (Rosen and Shihab 2016 ) justified their use of a manual approach by arguing that there was no tool that could give human readable topics based on word clusters. Thus, authors checked every word cluster generated and the documents used (an individual question of a Q&A website) to make sure they would label topics appropriately.

Table  17 shows how topics were named and the type of data analyzed. Table  18 shows how topics were named and the type of contributions they make. We observed the following:

Studies that modeled topics from developer documentation, transcripts and URLs did not mention topic naming. Studies that contributed with both exploration and comparison also did not mention topic naming;

Topics were mostly named in studies that used data from developer communication (ten occurrences) and in exploratory studies (22 occurrences).

From studies that compared topic models or topic modeling-based approaches (see Section  6.1.2 ), only one study (Yan et al. 2016b ) named topics (automatically with predefined labels).

Fourteen papers acknowledged limitations of manual topic naming:

Twelve papers (Bagherzadeh and Khatchadourian 2019 ; Ahmed and Bagherzadeh 2018 ; Martin et al. 2015 ; Hindle et al. 2013 ; Pagano and Maalej 2013 ; Zou et al. 2017 ; Pettinato et al. 2019 ; Layman et al. 2016 ; Ray et al. 2014 ; Tiarks and Maalej 2014 ; Mezouar et al. 2018 ; Abdellatif et al. 2020 ) acknowledged that how topics were named could be a threat to validity. For example, Layman et al. ( 2016 ) mentioned that they did not evaluate the accuracy of the manual topic naming, which was based on their expertise.

Three papers (Hindle et al. 2015 ; Bajracharya and Lopes 2012 ; Li et al. 2018 ) mentioned difficulties to assign names to topics. Hindle et al. ( 2015 ), for example, explained that labeling topics was difficult due to many project specific and unclear terms in clusters.

One paper (Pettinato et al. 2019 ) acknowledged that there is another topic naming approach that could be applied to their data: authors acknowledged that an automated extraction of topic names could replace manual labeling.

Hindle et al. ( 2015 ) provided some recommendations on topic analysis in software engineering based on their experiences. Below are some of their recommendations related to topic naming:

Some of the generated topics will not be relevant (e.g., clusters filled with common terms may not address any particular subject) and topics may be duplicated. This means that not all topics have to be named and used for analysis;

Domain experts can label topics better than non-experts, because they are more familiar to domain-specific keywords that may appear in word clusters;

It is important to rely on the relationship between topics generated and the original data. Hindle et al. ( 2015 ) argued that “the content of the topic can be interpreted in many different ways and LDA does not look for the same patterns that people do”.

6.5 Implications

The goal of this study was to describe how topic modeling is applied in software engineering research. We found studies that experimented, explored data, or proposed solutions to support different software engineering tasks with topic models. Our findings help researchers and practitioners as follows:

Understand which topic modeling techniques to use for what purpose . Researchers and practitioners that are going to select and apply a topic modeling technique, for example, to refactor legacy systems; may consider the experiences of other studies with similar objectives.

Pre-processing based on the type of data to be modeled . Pre-processing steps depend on the type of data analyzed (e.g., removing HTML tags in developer communication, mainly Q&A posts). Researchers and practitioners who, for example, intend to model topics from source code; may consider the same pre-processing steps that other studies applied to source code.

Understand how to name topics . Researchers and practitioners may check how other studies named topics to get insights on how to give meaning to their own topics.

We present some additional insights:

Appropriateness of topic modeling . Although we found that most of papers applied LDA “as is”, it may not be the best approach for other studies or for practical application. LDA is popular because it is an unsupervised model, i.e., it does not require previous knowledge about the data (e.g., pre-defined classes for model training), it is statistically more rigorous than other techniques (e.g., LSI), and it discovers latent relationships (i.e., topics) between documents in a large textual corpus (Griffiths and Steyvers 2004 ). However, LDA is an unstable and non-deterministic model. This means that generated topics cannot be replicated by others, even if the same model inputs (data pre-processing and configuration of parameters) are used. Furthermore, LDA performs poorly with short documents (Lin et al. 2014 ).

Meaningful topics . Topic models should discover semantically meaningful topics. Chang et al. ( 2009 ) argue about the importance of the interpretability of topics generated by probabilistic topic modeling techniques such as LDA. To create meaningful and replicable topics with LDA, Mantyla et al. ( 2018 ) highlight the importance of stabilizing the topic model (e.g., through tuning (Agrawal et al. 2018 )) and advocate the use of stability metrics (e.g., rank-biased overlap - RBO (Mantyla et al. 2018 )).

Research opportunities . Researchers interested in investigating topic modeling in software engineering may consider developing guidelines for researchers on how to use topic modeling, depending on the type of data, goals, etc. Further studies may also explore issues related to approaches for naming topics (e.g., based on domain experts), on the evaluation of the semantic accuracy of topics generated (e.g., how meaningful the topics are and if the context of document have to be considered), and on metrics to measure the performance of topic models supporting different software engineering tasks.

6.6 Threats to Validity

We analysed the validity threats to our study considering four types of threats to validity in systematic literature mapping studies (Petersen et al. 2015 ):

Theoretical validity This threat to validity refers to concerns related to capturing the data as intended, i.e., bias and limitations in the data selection and extraction. As we focused on the practice of topic modeling in software engineering, we restricted the search to highly ranked software engineering venues, which generally publish more mature studies. We used “topic model”, “topic model[l]ing”, “lsi”, “lda”, “plsi”, “latent dirichlet allocation”, “latent semantic” as search keywords to find all papers related to topic modeling. To select papers to the survey, we established inclusion and exclusion criteria. One author selected the papers and the others checked whether the selection criteria were applied appropriately. Furthermore, to minimize this threat in relation to data extraction, we first defined the data items (details are in Table  2 ) to be extracted from papers and the relevance of the data for each research question. Then, one author extracted the data and the others reviewed the results. Controversial data results were discussed to reach agreement.

Descriptive validity In the context of a literature survey, descriptive validity refers to bias and limitations in data synthesis and the accurate and objective description of the data. To mitigate this threat, we described in detail how the data was synthesized (see Section  4.3 ); furthermore, one of the authors synthesized the data and the others reviewed the results. Still, data and results depend on what is reported in papers which was sometimes incomplete, inconsistent or inaccurate (see for example information about document length).

Interpretive validity This threat to validity refers to bias and limitations in the results of the data analysis. We frequently reviewed the synthesized data during the data analysis and the authors with more experience in this type of study checked the occurrence of inconsistencies in results. Still, we recognize that interpretation bias may not have been removed completely.

Repeatability This threat to validity concerns whether the study and its results can be replicated. To reduce this threat, we described our search procedures in detail (Section  4 ), and the processes of data selection, extraction and synthesis in detail. We also followed general guidelines for systematic literature review as suggested by Kitchenham ( 2004 ) and mapping study method as suggested by Petersen et al. ( 2015 ). Furthermore, raw data of our study are available online 1 .

7 Conclusions

We analyzed 111 papers that applied topic modeling. These papers were published in the last twelve years (2009-2020) in ten highly ranked software engineering venues (five conferences and five journals). Below we summarize our findings:

LDA and LDA-based techniques are the most frequently used topic modeling techniques;

Topic modeling was mostly used to develop techniques for handling bugs (e.g., to predict defects). Exploratory studies that use topic modeling as a data analysis technique were also frequent;

Most papers modeled topics from source code (using methods as documents);

Most papers used LDA “as is” and without adapting values of hyperparameters ( α and β );

Most papers describe pre-processing. Some pre-processing steps depend on the type of textual data used (e.g., removal of URL and HTML tags), while others are commonly used in NLP techniques (e.g., stop words removal or stemming);

Only 36 (out of 111) papers named the topics. When naming topics, papers mostly adopted manual topic naming approaches such as deducting names (or labeling pre-defined names) based on the meaning of frequent words in that topic.

By analysing topic modeling techniques, data inputs, data pre-processing, and how topics were named, we identified characteristics and limitations in the use of topic models. Our study can provide insights and references to researchers and practitioners to make the best use of topic modeling, considering the experiences from previous studies.

Our study did not investigate all potential characteristics of topic modeling in software engineering or compared topic models to other text mining techniques. To answer our research questions, we analyzed data items shown in Table  2 . Future studies may investigate other characteristics of the use of topic modeling in software engineering, for example, topic modeling tools or libraries (e.g., Mallet) used; the context of a specific supported software engineering task; or compare topic modeling techniques to other text mining techniques, such as clustering and summarization (e.g., sentence or document embeddings). Furthermore, future work can reflect on other fields or uses of topic modeling to contrast how topic modeling is applied in software engineering. Further studies may also investigate how papers evaluate the performance of their topic modeling techniques, how papers evaluate the the quality of the generated topics, and how exactly word clusters were used when topics were not named.

https://doi.org/10.5281/zenodo.5280890

This table also shows hyperparameters and the number of topics which are discussed in the following subsection.

http://mallet.cs.umass.edu/topics.php

https://nlp.stanford.edu/software/tmt/tmt-0.4/

http://www.ai.mit.edu/projects/jmlr/papers/volume5/lewis04a/a11-smart-stop-list/english.stop

https://gist.github.com/sebleier/554280

https://github.com/mengjunxie/ae-lda/blob/master/misc/mallet-stopwords-en.txt

http://icon.shef.ac.uk/Moby/mwords.html

https://tartarus.org/martin/PorterStemmer/

https://pypi.org/project/krovetz/

Abdellatif A, Costa D, Badran K, Abdalkareem R, Shihab E (2020) Challenges in Chatbot Development: A Study of Stack Overflow Posts. In: Proceedings of the 17th international conference on mining software repositories. https://doi.org/10.1145/3379597.3387472 , vol 12. IEEE/ACM, Seoul, pp 174–185

Abdellatif TM, Capretz LF, Ho D (2019) Automatic recall of software lessons learned for software project managers. Inf Softw Technol 115:44–57. https://doi.org/10.1016/j.infsof.2019.07.006

Article   Google Scholar  

Aggarwal CC, Zhai C (2012) Mining text data. Springer, New York. https://doi.org/10.1007/978-1-4614-3223-4

Book   Google Scholar  

Agrawal A, Fu W, Menzies T (2018) What is wrong with topic modeling? And how to fix it using search-based software engineering. Inf Softw Technol 98(January 2017):74–88. https://doi.org/10.1016/j.infsof.2018.02.005

Ahasanuzzaman M, Asaduzzaman M, Roy CK, Schneider KA (2019) CAPS: a supervised technique for classifying Stack Overflow posts concerning API issues. Empir Softw Eng 25:1493–1532. https://doi.org/10.1007/s10664-019-09743-4

Ahmed S, Bagherzadeh M (2018) What do concurrency developers ask about?: A large-scale study using Stack Overflow. In: Proceedings of the international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3239235.3239524 . ACM, Oulu, pp 1–10

Ali N, Sharafi Z, Guéhéneuc Y G, Antoniol G (2015) An empirical study on the importance of source code entities for requirements traceability. Empir Softw Eng 20(2):442–478. https://doi.org/10.1007/s10664-014-9315-y

Alipour A, Hindle A, Stroulia E (2013) A contextual approach towards more accurate duplicate bug report detection. In: IEEE international working conference on mining software repositories. pp 183–192. https://doi.org/10.1109/MSR.2013.662402

Altarawy D, Shahin H, Mohammed A, Meng N (2018) LASCAD: Language-agnostic software categorization and similar application detection. J Syst Softw 142:21–34. https://doi.org/10.1016/j.jss.2018.04.018

ARC ARC (2012) Excellence in research for australia (ERA). https://www.arc.gov.au/excellence-research-australia http://www.arc.gov.au/pdf/era12/ERAFactsheet_Jan2012_1.pdf

Asuncion HU, Asuncion AU, Taylor RN (2010) Software traceability with topic modeling. In: Proceedings of the international conference on software engineering. IEEE/ACM, Cape Town, pp 95–104

Bagherzadeh M, Khatchadourian R (2019) Going big: a large-scale study on what big data developers ask. In: Proceedings of the 27th joint european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3338906.3338939 . ACM, Tallinn, pp 432–442

Bajaj K, Pattabiraman K, Mesbah A (2014) Mining questions asked by web developers. In: Proceedings of the 11th working conference on mining software repositories. https://doi.org/10.1145/2597073.2597083 . ACM, Hyderabad, pp 112–121

Bajracharya S, Lopes C (2009) Mining search topics from a code search engine usage log. In: Proceedings of the 6th international working conference on mining software repositories. https://doi.org/10.1109/MSR.2009.5069489 . IEEE, Vancouver, pp 111–120

Bajracharya SK, Lopes CV (2012) Analyzing and mining a code search engine usage log. Empir Softw Eng 17:424–466. https://doi.org/10.1007/s10664-010-9144-6

Barua A, Thomas SW, Hassan AE (2014) What are developers talking about? An analysis of topics and trends in Stack Overflow. Empir Softw Eng 19 (3):619–654. https://doi.org/10.1007/s10664-012-9231-y

Bavota G, Gethers M, Oliveto R, Poshyvanyk D, Lucia ADE (2014a) Improving software modularization via automated analysis of latent. ACM Trans Softw Eng Methodol 23(1):1–33. https://doi.org/10.1145/2559935

Bavota G, Oliveto R, Gethers M, Poshyvanyk D, De Lucia A (2014b) Methodbook: Recommending move method refactorings via relational topic models. IEEE Trans Softw Eng 40(7):671–694. https://doi.org/10.1109/TSE.2013.60

Beitzel SM, Jensen EC, Frieder O (2009) MAP. In: Encyclopedia of database systems. https://doi.org/10.1007/978-0-387-39940-9_492 . Springer US, Boston, pp 1691–1692

Belle AB, Boussaidi GE, Kpodjedo S (2016) Combining lexical and structural information to reconstruct software layers. Inf Softw Technol 74:1–16. https://doi.org/10.1016/j.infsof.2016.01.008

Bi T, Liang P, Tang A, Yang C (2018) A systematic mapping study on text analysis techniques in software architecture. J Syst Softw 144:533–558. https://doi.org/10.1016/j.jss.2018.07.055

Biggers LR, Bocovich C, Capshaw R, Eddy BP, Etzkorn LH, Kraft NA (2014) Configuring latent Dirichlet allocation based feature location. Empir Softw Eng 19(3):465–500. https://doi.org/10.1007/s10664-012-9224-x

Binkley D, Lawrie D, Uehlinger C, Heinz D (2015) Enabling improved IR-based feature location. J Syst Softw 101:30–42. https://doi.org/10.1016/j.jss.2014.11.013

Blasco D, Cetina C, Pastor O (2020) A fine-grained requirement traceability evolutionary algorithm: Kromaia, a commercial video game case study. Inf Softw Technol 119:1–12. https://doi.org/10.1016/j.infsof.2019.106235

Blei DM, Jordan MI, Griffiths TL, Tenenbaum JB (2003a) Hierarchical topic models and the nested chinese restaurant process. In: Proceedings of the 16th international conference on neural information processing systems. Neural Information Processing Systems Foundation, Vancouver, pp 17–24

Blei DM, Ng AY, Jordan MI (2003b) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993

MATH   Google Scholar  

Brank J, Mladenić D, Grobelnik M, Liu H, Mladenić D, Flach PA, Garriga GC, Toivonen H, Toivonen H (2011) F 1-measure. In: Encyclopedia of machine learning. https://doi.org/10.1007/978-0-387-30164-8_298 . Springer US, pp 397–397

Canfora G, Cerulo L, Cimitile M, Di Penta M (2014) How changes affect software entropy: An empirical study. Empir Softw Eng 19:1–38. https://doi.org/10.1007/s10664-012-9214-z

Cao B, Frank Liu X, Liu J, Tang M (2017) Domain-aware Mashup service clustering based on LDA topic model from multiple data sources. Inf Softw Technol 90:40–54. https://doi.org/10.1016/j.infsof.2017.05.001

Capiluppi A, Ruscio DD, Rocco JD, Nguyen PT, Ajienka N (2020) Detecting Java software similarities by using different clustering techniques. Inf Softw Technol 122. https://doi.org/10.1016/j.infsof.2020.106279

Catolino G, Palomba F, Zaidman A, Ferrucci F (2019) Not all bugs are the same: Understanding, characterizing, and classifying bug types. J Syst Softw 152:165–181. https://doi.org/10.1016/j.jss.2019.03.002

Chang J, Blei DM (2009) Relational topic models for document networks. In: Proceedings of the 12th international conference on artificial intelligence and statistics. Society for Artificial Intelligence and Statistics, Clearwater Beach, pp 81–88

Chang J, Blei DM (2010) Hierarchical relational models for document networks. Ann Appl Stat 4(1):124–150. https://doi.org/10.1214/09-AOAS309

Article   MathSciNet   MATH   Google Scholar  

Chang J, Boyd-Graber J, Gerrish S, Wang C, Blei DM (2009) Reading tea leaves: How humans interpret topic models. In: Proceedings of the 2009 conference advances in neural information. Neural Information Processing Systems Foundation, Vancouver, pp 288–296

Chatterjee P, Damevski K, Pollock L (2019) Exploratory study of slack q&a chats as a mining source for software engineering tools. In: Proceedings of the 16th international conference on mining software repositories. IEEE, Montreal, pp 1–12

Chen H, Coogle J, Damevski K (2019) Modeling stack overflow tags and topics as a hierarchy of concepts. J Syst Softw 156:283–299. https://doi.org/10.1016/j.jss.2019.07.033

Chen L, Hassan F, Wang X, Zhang L (2020) Taming behavioral backward incompatibilities via cross-project testing and analysis. In: Proceedings of the 42nd international conference on software engineering. https://doi.org/10.1145/3377811.3380436 . IEEE/ACM, Seoul, pp 112–124

Chen N, Lin J, Hoi SC, Xiao X, Zhang B (2014) AR-miner: Mining informative reviews for developers from mobile app marketplace. In: Proceedings of the international conference on software engineering. https://doi.org/10.1145/2568225.2568263 , vol 1. IEEE/ACM, Hyderabad, pp 767–778

Chen TH, Thomas SW, Nagappan M, Hassan AE (2012) Explaining software defects using topic models. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2012.6224280 . IEEE, Zurich, pp 189–198

Chen TH, Thomas SW, Hassan AE (2016) A survey on the use of topic models when mining software repositories. Empir Softw Eng 21(5):1843–1919. https://doi.org/10.1007/s10664-015-9402-8

Chen TH, Shang W, Nagappan M, Hassan AE, Thomas SW (2017) Topic-based software defect explanation. J Syst Softw 129:79–106. https://doi.org/10.1016/j.jss.2016.05.015

Choetkiertikul M, Dam HK, Tran T, Ghose A (2017) Predicting the delay of issues with due dates in software projects. Empir Softw Eng 22:1223–1263. https://doi.org/10.1007/s10664-016-9496-7

Craswell N (2009) Mean reciprocal rank. In: Encyclopedia of database systems. https://doi.org/10.1007/978-0-387-39940-9_488 . Springer US, pp 1703–1703

Croft WB, Metzler D (2010) Search engines: Information retrieval in practice. Addison-Wesley, Reading

Google Scholar  

Cui D, Liu T, Cai Y, Zheng Q, Feng Q, Jin W, Guo J, Qu Y (2019) Investigating the impact of multiple dependency structures on software defects, IEEE/ACM, Montreal. https://doi.org/10.1109/ICSE.2019.00069

Damevski K, Chen H, Shepherd DC, Kraft NA, Pollock L (2018) Predicting future developer behavior in the IDE using topic models. IEEE Trans Softw Eng 44(11):1100–1111. https://doi.org/10.1109/TSE.2017.2748134

De Lucia A, Di Penta M, Oliveto R, Panichella A, Panichella S (2014) Labeling source code with information retrieval methods: An empirical study. Empir Softw Eng 19(5):1383–1420. https://doi.org/10.1007/s10664-013-9285-5

Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R (1990) Indexing by latent semantic analysis. J Am Soc Inf Sci 41(6): 391-407 https://doi.org/10.1002/(SICI)1097-4571(199009)41:6<391::AID-ASI1>3.0.CO;2-9

Demissie BF, Ceccato M, Shar LK (2020) Security analysis of permission re-delegation vulnerabilities in Android apps. Empir Softw Eng 25:5084–5136. https://doi.org/10.1007/s10664-020-09879-8

Dietz L, Bickel S, Scheffer T (2007) Unsupervised prediction of citation influences. In: Proceedings of the 24th international conference on machine learning. https://doi.org/10.1145/1273496.1273526 . ACM, Corvallis, pp 233–240

Dit B, Revelle M, Poshyvanyk D (2013) Integrating information retrieval, execution and link analysis algorithms to improve feature location in software. Empir Softw Eng 18(2):277–309. https://doi.org/10.1007/s10664-011-9194-4

El Zarif O, Da Costa DA, Hassan S, Zou Y (2020) On the relationship between user churn and software issues. In: Proceedings of the 17th international conference on mining software repositories. https://doi.org/10.1145/3379597.3387456 . ACM, New York, pp 339–349

Fowkes J, Chanthirasegaran P, Ranca R, Allamanis M, Lapata M, Sutton C (2016) Autofolding for source code summarization. Proc Int Conf Softw Eng 43(12):649–652. https://doi.org/10.1145/2889160.2889171

Fu Y, Yan M, Zhang X, Xu L, Yang D, Kymer JD (2015) Automated classification of software change messages by semi-supervised Latent Dirichlet Allocation. Inf Softw Technol 57:369–377. https://doi.org/10.1016/j.infsof.2014.05.017

Galvis Carreno LV, Winbladh K (2012) Analysis of user comments: an approach for software requirements evolution. In: Proceedings of the international conference on software engineering. IEEE/ACM, San Francisco, pp 582–591

Gao C, Zeng J, Lyu MR, King I (2018) Online app review analysis for identifying emerging issues. In: Proceedings of the 40th international conference on software engineering. https://doi.org/10.1145/3180155.3180218 . IEEE/ACM, Gothenburg, pp 48–58

Gopalakrishnan R, Sharma P, Mirakhorli M, Galster M (2017) Can latent topics in source code predict missing architectural tactics?. In: Proceedings of the 39th international conference on software engineering, IEEE/ACM, pp 15–26. https://doi.org/10.1109/ICSE.2017.10 . http://ghtorrent.org/

Gorla A, Tavecchia I, Gross F, Zeller A (2014) Checking app behavior against app descriptions. In: Proceedings of the international conference on software engineering. https://doi.org/10.1145/2568225.2568276 . IEEE/ACM, Hyderabad, pp 1025–1035

Griffiths TL, Steyvers M (2004) Finding scientific topics. In: Proceedings of the national academy of sciences. https://doi.org/10.1073/pnas.0307752101 , vol 101. Neural Information Processing Systems Foundation, Irvine, pp 5228–5235

Haghighi A, Vanderwende L (2009) Exploring content models for multi-document summarization. In: Proceedings of the conference on human language technologies: the 2009 annual conference of the north american chapter of the association for computational linguistics. https://doi.org/10.3115/1620754.1620807 , http://www-nlpir.nist.gov/projects/duc/data.html . Association for Computational Linguistics, Boulder, pp 362–370

Han J, Shihab E, Wan Z, Deng S, Xia X (2020) What do programmers discuss about deep learning frameworks. Empir Softw Eng 25:2694–2747. https://doi.org/10.1007/s10664-020-09819-6

Haque MU, Ali Babar M (2020) Challenges in docker development: a large-scale study using stack overflow. In: Proceedings of the 14th international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3382494.3410693 . IEEE/ACM, Bari, pp 1–11

Hariri N, Castro-Herrera C, Mirakhorli M, Cleland-Huang J, Mobasher B (2013) Supporting domain analysis through mining and recommending features from online product listings. IEEE Trans Softw Eng 39(12):1736–1752. https://doi.org/10.1109/TSE.2013.39

Henß S, Monperrus M, Mezini M (2012) Semi-automatically extracting FAQs to improve accessibility of software development knowledge. In: Proceedings of the international conference on software engineering. https://doi.org/10.1109/ICSE.2012.6227139 . IEEE/ACM, Zurich, pp 793–803

Hindle A, Godfrey MW, Ernst NA, Mylopoulos J (2011) Automated topic naming to support cross-project analysis of software maintenance activities. In: Proceedings of the 33rd international conference on software engineering. ACM, Waikiki, pp 163–172

Hindle A, Ernst NA, Godfrey MW, Mylopoulos J (2013) Automated topic naming: Supporting cross-project analysis of software maintenance activities. Empir Softw Eng 18(6):1125–1155. https://doi.org/10.1007/s10664-012-9209-9

Hindle A, Bird C, Zimmermann T, Nagappan N (2015) Do topics make sense to managers and developers? Empir Softw Eng 20:479–515. https://doi.org/10.1007/s10664-014-9312-1

Hindle A, Alipour A, Stroulia E (2016) A contextual approach towards more accurate duplicate bug report detection and ranking. Empir Softw Eng 21 (2):368–410. https://doi.org/10.1007/s10664-015-9387-3

Hoffman M, Blei D, Bach F (2010) Online learning for latent dirichlet allocation. In: Proceedings of the neural information processing systems conference. https://doi.org/10.1.1.187.1883. Neural Information Processing Systems Foundation, Vancouver, pp 1–9

Hofmann T (1999) Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international conference on research and development in information retrieval. ACM, Berkeley, pp 50–57

Hu H, Bezemer CP, Hassan AE (2018) Studying the consistency of star ratings and the complaints in 1 & 2-star user reviews for top free cross-platform Android and iOS apps. Empir Softw Eng 23(6):3442–3475. https://doi.org/10.1007/s10664-018-9604-y

Hu H, Wang S, Bezemer CP, Hassan AE (2019) Studying the consistency of star ratings and reviews of popular free hybrid Android and iOS apps. Empir Softw Eng 24:7–32. https://doi.org/10.1007/s10664-018-9617-6

Hu W, Wong K (2013) Using citation influence to predict software defects. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6624058 . IEEE, San Francisco, pp 419–428

Jiang H, Zhang J, Ren Z, Zhang T (2017) An unsupervised approach for discovering relevant tutorial fragments for APIs. In: Proceedings of the 39th international conference on software engineering. https://doi.org/10.1109/ICSE.2017.12 . IEEE/ACM, Buenos Aires, pp 38–48

Jiang HE, Zhang J, Li X, Ren Z, Lo D, Wu X, Luo Z (2019) Recommending new features from mobile app descriptions. ACM Trans Softw Eng Methodol 28(4):1–29. https://doi.org/10.1145/3344158

Jipeng Q, Zhenyu Q, Yun L, Yunhao Y, Xindong W (2020) Short text topic modeling techniques, applications, and performance: a survey. https://doi.org/10.1109/TKDE.2020.2992485

Jo Y, Oh A (2011) Aspect and sentiment unification model for online review analysis. In: Proceedings of the fourth ACM international conference on Web search and data mining. https://doi.org/10.1145/1935826 . ACM, New York, pp 815–824

Jones JA, Harrold MJ (2005) Empirical evaluation of the tarantula automatic fault-localization technique. In: Proceedings of the 20th international conference on automated software engineering. https://doi.org/10.1145/1101908.1101949 , http://portal.acm.org/citation.cfm?doid=1101908.1101949 . IEEE/ACM, New York, pp 273–282

Kakas AC, Cohn D, Dasgupta S, Barto AG, Carpenter GA, Grossberg S, Webb GI, Dorigo M, Birattari M, Toivonen H, Timmis J, Branke J, Toivonen H, Strehl AL, Drummond C, Coates A, Abbeel P, Ng AY, Zheng F, Webb GI, Tadepalli P (2011) Area under curve. In: Encyclopedia of machine learning. https://doi.org/10.1007/978-0-387-30164-8_28 . Springer US, pp 40–40

Kitchenham BA (2004) Procedures for performing systematic reviews. Keele, UK, Keele University 33(TR/SE-0401):28. https://doi.org/10.1.1.122.3308

Layman L, Nikora AP, Meek J, Menzies T (2016) Topic modeling of NASA space system problem reports research in practice. In: Proceedings of the 13th working conference on mining software repositories. https://doi.org/10.1145/2901739.2901760 . ACM, Austin, pp 303–314

Le TDB, Thung F, Lo D (2017) Will this localization tool be effective for this bug? Mitigating the impact of unreliability of information retrieval based bug localization tools. Empir Softw Eng 22:2237–2279. https://doi.org/10.1007/s10664-016-9484-y

Leach RJ (2016) Introduction to software engineering, 2nd edn. CRC Press LLC, Boca Raton. https://ebookcentral.proquest.com/lib/canterbury/detail.action?docID=4711469&query=Software+Engineering

Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401(6755):788–791

Article   MATH   Google Scholar  

Li H, Chen THP, Shang W, Hassan AE (2018) Studying software logging using topic models. Empir Softw Eng 23:2655–2694. https://doi.org/10.1007/s10664-018-9595-8

Lian X, Liu W, Zhang L (2020) Assisting engineers extracting requirements on components from domain documents. Inf Softw Technol 118(September 2019):106196. https://doi.org/10.1016/j.infsof.2019.106196

Lin T, Tian W, Mei Q, Cheng H (2014) The dual-sparse topic model: Mining focused topics and focused terms in short text. In: Proceedings of the 23rd international conference on world wide web. https://doi.org/10.1145/2566486.2567980 . ACM, Seoul, pp 539–549

Liu Y, Liu L, Liu H, Wang X, Yang H (2017) Mining domain knowledge from app descriptions. J Syst Softw 133:126–144. https://doi.org/10.1016/j.jss.2017.08.024

Liu Y, Lin J, Cleland-Huang J (2020) Traceability support for multi-lingual software projects. In: Proceedings of the 17th international conference on mining software repositories. https://doi.org/10.1145/3379597.3387440 . ACM, Seoul, pp 443–454

Lukins SK, Kraft NA, Etzkorn LH (2010) Bug localization using latent Dirichlet allocation. Inf Softw Technol 52:972–990. https://doi.org/10.1016/j.infsof.2010.04.002

Luo Q, Moran K, Poshyvanyk D (2016) A large-scale empirical comparison of static and dynamic test case prioritization techniques. In: Proceedings of the 24th international symposium on foundations of software engineering. https://doi.org/10.1145/2950290.2950344 . ACM, Seattle, pp 559–570

Mahmoud A, Bradshaw G (2017) Semantic topic models for source code analysis. Empir Softw Eng 22(4):1965–2000. https://doi.org/10.1007/s10664-016-9473-1

Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60. https://doi.org/10.1214/aoms/1177730491 , http://projecteuclid.org/euclid.aoms/1177730491

Manning CD, Raghavan P, Schütze H (2008) Evaluation of Clustering. In: Introduction to information retrieval. chap 16, https://doi.org/10.33899/csmj.2008.163987 . https://nlp.stanford.edu/IR-book/html/htmledition/evaluation-of-clustering-1.html , http://nlp.stanford.edu/IR?book/html/htmledition/evaluation?of?clustering?1.htmlwhereisthesetofclustersan . Cambridge University Press

Mantyla MV, Claes M, Farooq U (2018) Measuring LDA topic stability from clusters of replicated runs, ACM, Oulu. https://doi.org/10.1145/3239235.3267435

Martin W, Harman M, Jia Y, Sarro F, Zhang Y (2015) The app sampling problem for app store mining. In: Proceedings of the 12th international working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.19 . IEEE, Florence, pp 123–133

Martin W, Sarro F, Harman M (2016) Causal impact analysis for app releases in google play. In: Proceedings of the 24th international symposium on foundations of software engineering. https://doi.org/10.1145/2950290.2950320 . ACM, Seattle, pp 435–446

McIlroy S, Ali N, Khalid H, E Hassan A (2016) Analyzing and automatically labelling the types of user issues that are raised in mobile app reviews. Empir Softw Eng 21:1067–1106. https://doi.org/10.1007/s10664-015-9375-7

Mehrotra R, Sanner S, Buntine W, Xie L (2013) Improving LDA Topic Models for Microblogs via Tweet Pooling and Automatic Labeling. In: Proceedings of the 36th International Conference on Research and Development in Information Retrieval. ACM, Dublin, pp 889–892

Mezouar ME, Zhang F, Zou Y (2018) Are tweets useful in the bug fixing process? An empirical study on Firefox and Chrome. Empir Softw Eng 23 (3):1704–1742. https://doi.org/10.1007/s10664-017-9559-4

Miner G, Elder J, Fast A, Hill T, Nisbet R, Delen D (2012) Practical text mining and statistical analysis for non-structured text data applications. Elsevier Science & Technology, Waltham . https://doi.org/10.1016/C2010-0-66188-8

Moslehi P, Adams B, Rilling J (2016) On mining crowd-based speech documentation. In: Proceedings of the 13th working conference on mining software repositories. https://doi.org/10.1145/2901739.2901771 . ACM, Austin, pp 259–268

Moslehi P, Adams B, Rilling J (2018) Feature location using crowd-based screencasts. In: Proceedings of the 15th international conference on mining software repositories. https://doi.org/10.1145/3196398.3196439 . ACM, New York, pp 192–202

Moslehi P, Adams B, Rilling J (2020) A feature location approach for mapping application features extracted from crowd-based screencasts to source code. Empir Softw Eng 25:4873–4926. https://doi.org/10.1007/s10664-020-09874-z

Murali V, Chaudhuri S, Jermaine C (2017) Bayesian specification learning for finding API usage errors. In: Proceedings of the Joint european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3106237.3106284 . ACM, Paderborn, pp 151–162

Nabli H, Ben Djemaa R, Ben Amor IA (2018) Efficient cloud service discovery approach based on LDA topic modeling. J Syst Softw 146:233–248. https://doi.org/10.1016/j.jss.2018.09.069

Naguib H, Narayan N, Brügge B, Helal D (2013) Bug report assignee recommendation using activity profiles. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2013.6623999 . IEEE, San Francisco, pp 22–30

Nayebi M, Cho H, Ruhe G (2018) App store mining is not enough for app improvement. Empir Softw Eng 23:2764–2794. https://doi.org/10.1007/s10664-018-9601-1

Nguyen AT, Nguyen TT, Al-Kofahi J, Nguyen HV, Nguyen TN (2011) A topic-based approach for narrowing the search space of buggy files from a bug report. In: Proceedings of the 26th international conference on automated software engineering. https://doi.org/10.1109/ASE.2011.6100062 . IEEE/ACM, Lawrence, pp 263–272

Nguyen AT, Nguyen TT, Nguyen TN, Lo D, Sun C (2012) Duplicate bug report detection with a combination of information retrieval and topic modeling. In: Proceedings of the 27th international conference on automated software engineering. https://doi.org/10.1145/2351676.2351687 . IEEE/ACM, Essen, pp 70–79

Nguyen VA, Boyd-Graber J, Resnik P, Chang J, Graber JB (2014) Learning a concept hierarchy from multi-labeled documents. In: Proceedings of the neural information processing systems conference. Neural Information Processing Systems Foundation, Montreal, pp 1–9

Noei E, Heydarnoori A (2016) EXAF: A search engine for sample applications of object-oriented framework-provided concepts. Inf Softw Technol 75:135–147. https://doi.org/10.1016/j.infsof.2016.03.007

Noei E, Da Costa DA, Zou Y (2018) Winning the app production rally. In: Proceedings of the 26th ACM joint meeting on european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3236024.3236044 . ACM, Lake Buena Vista, pp 283–294

Noei E, Zhang F, Wang S, Zou Y (2019) Towards prioritizing user-related issue reports of mobile applications. Empir Softw Eng 24:1964–1996. https://doi.org/10.1007/s10664-019-09684-y

Pagano D, Maalej W (2013) How do open source communities blog? Empir Softw Eng 18(6):1090–1124. https://doi.org/10.1007/s10664-012-9211-2

Palomba F, Salza P, Ciurumelea A, Panichella S, Gall H, Ferrucci F, De Lucia A (2017) Recommending and localizing change requests for mobile apps based on user reviews. In: Proceedings of the 39th international conference on software engineering. https://doi.org/10.1109/ICSE.2017.18 . IEEE/ACM, Buenos Aires, pp 106–117

Panichella A, Dit B, Oliveto R, Di Penta M, Poshynanyk D, De Lucia A (2013) How to effectively use topic models for software engineering tasks? An approach based on Genetic Algorithms. In: Proceedings of the international conference on software engineering. https://doi.org/10.1109/ICSE.2013.6606598 . IEEE/ACM, San Francisco, pp 522–531

Pérez F, Lapeṅa R, Font J, Cetina C (2018) Fragment retrieval on models for model maintenance: Applying a multi-objective perspective to an industrial case study. Inf Softw Technol 103:188–201. https://doi.org/10.1016/j.infsof.2018.06.017

Petersen K, Vakkalanka S, Kuzniarz L (2015) Guidelines for conducting systematic mapping studies in software engineering: An update. Inf Softw Technol 64(1):1–18. https://doi.org/10.1016/j.infsof.2015.03.007

Pettinato M, Gil JP, Galeas P, Russo B (2019) Log mining to re-construct system behavior: An exploratory study on a large telescope system. Inf Softw Technol 114:121–136. https://doi.org/10.1016/j.infsof.2019.06.011

Poshyvanyk D, Gueheneuc YG, Marcus A, Antoniol G, Rajlich V (2007) Feature location using probabilistic ranking of methods based on execution scenarios and information retrieval. https://doi.org/10.1109/TSE.2007.1016 . https://www.researchgate.net/publication/3189749 , vol 33, pp 420–431

Poshyvanyk D, Marcus A, Ferenc R, Gyimóthy T (2009) Using information retrieval based coupling measures for impact analysis. Empir Softw Eng 14(1):5–32. https://doi.org/10.1007/s10664-008-9088-2 , http://www.mozilla.org/

Poshyvanyk D, Gethers M, Marcus A (2012) Concept location using formal concept analysis and information retrieval. ACM Trans Softw Eng Methodol 21(4):1–34. https://doi.org/10.1145/2377656.2377660

Poursabzi-Sangdeh F, Goldstein DG, Hofman JM, Vaughan JW, Wallach H (2021) Manipulating and measuring model interpretability. In: Proceedings of the conference on human factors in computing systems. https://doi.org/10.1145/3411764.3445315 . ACM, Yokohama

Ramage D, Hall D, Nallapati R, Manning CD (2009) Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the conference on empirical methods in natural language processing. https://doi.org/10.5555/1699510.1699543 . ACL/AFNLP, Singapore, pp 248–256

Rao S, Kak A (2011) Retrieval from software libraries for bug localization: A comparative study of generic and composite text models. In: Proceedings of the international conference on software engineering. https://doi.org/10.1145/1985441.1985451 . IEEE/ACM, Waikiki, pp 43–52

Ray B, Posnett D, Filkov V, Devanbu P (2014) A large scale study of programming languages and code quality in GitHub. In: Proceedings of the symposium on the foundations of software engineering, pp 155–165. https://doi.org/10.1145/2635868.2635922

Revelle M, Gethers M, Poshyvanyk D (2011) Using structural and textual information to capture feature coupling in object-oriented software. Empir Softw Eng 16(6):773–811. https://doi.org/10.1007/s10664-011-9159-7

Röder M, Both A, Hinneburg A (2015) Exploring the space of topic coherence measures. In: Proceedings of the eighth ACM international conference on web search and data mining - WSDM ’15. https://doi.org/10.1145/2684822.2685324 . ACM, Shanghai, pp 399–408

Rosen C, Shihab E (2016) What are mobile developers asking about? A large scale study using Stack Overflow. Empir Softw Eng 21:1192–1223. https://doi.org/10.1007/s10664-015-9379-3

Rosenberg CM, Moonen L (2018) Improving problem identification via automated log clustering using dimensionality reduction. In: Proceedings of the international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3239235.3239248 . ACM, Oulu, pp 1–10

Rothermel G, Untcn RH, Chu C, Harrold MJ (2001) Prioritizing test cases for regression testing. IEEE Trans Softw Eng 27(10):929–948. https://doi.org/10.1109/32.962562

Salton G, Wong A, Yang CS (1975) A vector space model for automatic indexing. Commun ACM 18(11):613–620. https://doi.org/10.1145/361219.361220

Savage T, Dit B, Gethers M, Poshyvanyk D (2010) TopicXP: exploring topics in source code using latent Dirichlet allocation. IEEE, Timisoara. https://doi.org/10.1109/ICSM.2010.5609654

Shannon CE (1948) A mathematical theory of communication. Bell Syst Tech J 27(3):379–423. https://doi.org/10.1002/j.1538-7305.1948.tb01338.x

Shimagaki J, Kamei Y, Ubayashi N, Hindle A (2018) Automatic topic classification of test cases using text mining at an android smartphone vendor. In: Proceedings of the 12th international symposium on empirical software engineering and measurement. https://doi.org/10.1145/3239235.3268927 . IEEE/ACM, Oulu, pp 1–10

Silva B, Sant’anna C, Rocha N, Chavez C (2016) The effect of automatic concern mapping strategies on conceptual cohesion measurement. Inf Softw Technol 75:56–70. https://doi.org/10.1016/j.infsof.2016.03.006

Silva LL, Valente MT, Maia MA (2019) Co-change patterns: A large scale empirical study. J Syst Softw 152:196–214. https://doi.org/10.1016/j.jss.2019.03.014

Soliman M, Galster M, Salama AR, Riebisch M (2016) Architectural knowledge for technology decisions in developer communities: An exploratory study with Stack Overflow. In: Proceedings of the 13th working conference on software architecture. https://doi.org/10.1109/WICSA.2016.13 . IEEE, Venice, pp 128–133

Somasundaram K, Murphy GC (2012) Automatic categorization of bug reports using latent Dirichlet allocation. In: Proceedings of the 5th India software engineering conference. https://doi.org/10.1145/2134254.2134276 , vol 12. ACM, pp 125–130

Souza LB, Campos EC, Madeiral F, Paixão K, Rocha AM, Maia M d A (2019) Bootstrapping cookbooks for APIs from crowd knowledge on Stack Overflow. Inf Softw Technol 111(March 2018):37–49. https://doi.org/10.1016/j.infsof.2019.03.009

Steyvers M, Griffiths T (2010) Probalistic Topic Models. In: Landauer T, McNamara D, Dennis S, Kintsch W (eds) Latent semantic analysis: a road to meaning. https://doi.org/10.1016/s0364-0213(01)00040-4 . University of California, Irvine, pp 993–1022

Sun X, Li B, Leung H, Li B, Li Y (2015) MSR4SM: Using topic models to effectively mining software repositories for software maintenance tasks. Inf Softw Technol 66:1–12. https://doi.org/10.1016/j.infsof.2015.05.003

Sun X, Liu X, Li B, Duan Y, Yang H, Hu J (2016) Exploring topic models in software engineering data analysis: A survey, IEEE, Shangai. https://doi.org/10.1109/SNPD.2016.7515925

Sun X, Yang H, Xia X, Li B (2017) Enhancing developer recommendation with supplementary information via mining historical commits. J Syst Softw 134:355–368. https://doi.org/10.1016/j.jss.2017.09.021

Taba SES, Keivanloo I, Zou Y, Wang S (2017) An exploratory study on the usage of common interface elements in android applications. J Syst Softw 131:491–504. https://doi.org/10.1016/j.jss.2016.07.010

Tairas R, Gray J (2009) An information retrieval process to aid in the analysis of code clones. https://doi.org/10.1007/s10664-008-9089-1 , http://www.cis.uab.edu/tairasr/clones/literature , vol 14, pp 33–56

Tamrawi A, Nguyen TT, Al-Kofahi JM, Nguyen TN (2011) Fuzzy set and cache-based approach for bug triaging. In: Proceedings of the 19th ACM symposium on foundations of software engineering. https://doi.org/10.1145/2025113.202516 . ACM, pp 365–375

Tang J, Zhang M, Mei Q (2013) One theme in all views: modeling consensus topics in multiple contexts. In: Proceedings of the 19th international conference on knowledge discovery and data mining. ACM, New York, pp 5–13

Tantithamthavorn C, Lemma Abebe S, Hassan AE, Ihara A, Matsumoto K (2018) The impact of IR-based classifier configuration on the performance and the effort of method-level bug localization. Inf Softw Technol 102(June):160–174. https://doi.org/10.1016/j.infsof.2018.06.001

Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581. https://doi.org/10.1198/016214506000000302

Thomas SW, Nagappan M, Blostein D, Hassan AE (2013) The impact of classifier configuration and classifier combination on bug localization. IEEE Trans Softw Eng 39(10):1427–1443. https://doi.org/10.1109/TSE.2013.27

Thomas SW, Hemmati H, Hassan AE, Blostein D (2014) Static test case prioritization using topic models. Empir Softw Eng 19:182–212. https://doi.org/10.1007/s10664-012-9219-7

Tiarks R, Maalej W (2014) How does a typical tutorial for mobile development look like?. In: Proceedings of the 11th international conference on mining software repositories. https://doi.org/10.1145/2597073.2597106 . IEEE/ACM, Hyderabad, pp 272–281

Treude C, Wagner M (2019) Predicting good configurations for GitHub and stack overflow topic models. In: Proceedings of the 16th international conference on mining software repositories. https://doi.org/10.1109/MSR.2019.00022 . IEEE, Montreal, pp 84–95

Vargha A, Delaney HD (2000) A critique and improvement of the CL common language effect size statistics of McGraw and Wong. J Educ Behav Stat 25(2):101–132. https://doi.org/10.3102/10769986025002101

Wallach HM, Mimno D, McCallum A (2009) Rethinking LDA: Why priors matter. In: Proceedings of the conference on advances in neural information processing systems. Curran Associates Inc., Vancouver, pp 1973–1981. http://rexa.info/

Wang C, Blei DM (2011) Collaborative topic modeling for recommending scientific articles. In: Proceedings of the international conference on knowledge discovery and data mining. https://doi.org/10.1145/2020408.2020480 . ACM, New York, pp 448–456

Wang W, Malik H, Godfrey MW (2015) Recommending posts concerning API issues in developer Q&A sites. In: Proceedings of the international working conference on mining software repositories. https://doi.org/10.1109/MSR.2015.28 . http://stackoverflow.com/questions/5358219/ . IEEE/ACM, pp 224–234

Wei X, Croft WB (2006) LDA-based document models for ad-hoc retrieval. In: Proceedings of the 29th annual international conference on research and development in information retrieval. https://doi.org/10.1145/1148170.1148204 . ACM, Seattle, pp 178–185

Weng J, Lim EP, Jiang J, He Q (2010) TwitterRank: Finding topic-sensitive influential twitterers. In: Proceedings of the 3rd international conference on web search and data mining. https://doi.org/10.1145/1718487.1718520 . ACM, New York, pp 261–270

Wold S, Esbensen K, Geladi P (1987) Principal component analysis. Chemom Intell Lab Syst 2:37–52. https://doi.org/10.1016/0169-7439(87)80084-9

Xia X, Bao L, Lo D, Kochhar PS, Hassan AE, Xing Z (2017a) What do developers search for on the web? Empir Softw Eng 22(6):3149–3185. https://doi.org/10.1007/s10664-017-9514-4

Xia X, Lo D, Ding Y, Al-Kofahi JM, Nguyen TN, Wang X (2017b) Improving automated bug triaging with specialized topic model. IEEE Trans Softw Eng 43(3):272–297. https://doi.org/10.1109/TSE.2016.2576454

Yan M, Fu Y, Zhang X, Yang D, Xu L, Kymer JD (2016a) Automatically classifying software changes via discriminative topic model: Supporting multi-category and cross-project. J Syst Softw 113:296–308. https://doi.org/10.1016/j.jss.2015.12.019

Yan M, Zhang X, Yang D, Xu L, Kymer JD (2016b) A component recommender for bug reports using Discriminative Probability Latent Semantic Analysis. Inf Softw Technol 73:37–51. https://doi.org/10.1016/j.infsof.2016.01.005

Yang X, Lo D, Li L, Xia X, Bissyandé T F, Klein J (2017) Characterizing malicious Android apps by mining topic-specific data flow signatures. Inf Softw Technol 90:27–39. https://doi.org/10.1016/j.infsof.2017.04.007

Ye D, Xing Z, Kapre N (2017) The structure and dynamics of knowledge network in domain-specific Q&A sites: a case study of stack overflow. Empir Softw Eng 22(1):375–406. https://doi.org/10.1007/s10664-016-9430-z

Zaman S, Adams B, Hassan AE (2011) Security versus performance bugs: A case study on firefox. In: Proceedings - international conference on software engineering. https://doi.org/10.1145/1985441.198545 , pp 93–102

Zeugmann T, Poupart P, Kennedy J, Jin X, Han J, Saitta L, Sebag M, Peters J, Bagnell JA, Daelemans W, Webb GI, Ting KM, Ting KM, Webb GI, Shirabad JS, Fürnkranz J, Hüllermeier E, Matwin S, Sakakibara Y, Flener P, Schmid U, Procopiuc CM, Lachiche N, Fürnkranz J (2011) Precision and recall. In: Encyclopedia of machine learning. https://doi.org/10.1007/978-0-387-30164-8_652 . Springer US, pp 781–781

Zhang E, Zhang Y (2009) Average precision. In: Encyclopedia of database systems. https://doi.org/10.1007/978-0-387-39940-9_482 . Springer US, pp 192–193

Zhang T, Chen J, Yang G, Lee B, Luo X (2016) Towards more accurate severity prediction and fixer recommendation of software bugs. J Syst Softw 117:166–184. https://doi.org/10.1016/j.jss.2016.02.034

Zhang Y, Lo D, Xia X, Scanniello G, Le TDB, Sun J (2018) Fusing multi-abstraction vector space models for concern localization. Empir Softw Eng 23:2279–2322. https://doi.org/10.1007/s10664-017-9585-2

Zhao N, Chen J, Wang Z, Peng X, Wang G, Wu Y, Zhou F, Feng Z, Nie X, Zhang W, Sui K, Pei D (2020) Real-time incident prediction for online service systems. In: Proceedings of the 28th ACM joint meeting european software engineering conference and symposium on the foundations of software engineering. https://doi.org/10.1145/3368089.3409672 , vol 20. ACM, pp 315–326

Zhao WX, Jiang J, Weng J, He J, Lim EP, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: Lecture Notes in Computer Science. https://doi.org/10.1007/978-3-642-20161-5-34 , vol 6611. Springer, Berlin, chap Advances i, pp 338–349

Zhao Y, Zhanq F, Shlhab E, Zou Y, Hassan AE (2016) How are discussions associated with bug reworking? an empirical study on open source projects. In: Proceedings of the 10th international symposium on empirical software engineering and measurement. https://doi.org/10.1145/2961111.296259 . IEEE/ACM, Ciudad Real, pp 1–10

Zou J, Xu L, Yang M, Zhang X, Yang D (2017) Towards comprehending the non-functional requirements through Developers’ eyes: An exploration of Stack Overflow using topic analysis. Inf Softw Technol 84(1):19–32. https://doi.org/10.1016/j.infsof.2016.12.003

Download references

Acknowledgements

We would like to thank the editor and the anonymous reviewers for their insightful and detailed feedback that helped us to significantly improve the manuscript.

Author information

Authors and affiliations.

University of Canterbury, Christchurch, New Zealand

Camila Costa Silva, Matthias Galster & Fabian Gilson

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Camila Costa Silva .

Ethics declarations

Conflict of interests.

The authors declare that they have no conflict of interest.

Additional information

Communicated by: Andrea De Lucia

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

1.1 A.1 Papers Reviewed

1.2 a.2 metrics used in comparative studies.

The column “Context-specific” indicates if the metric was proposed or adapted to a specific context (“Yes”) or is a standard NLP metric (“No”).

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Silva, C.C., Galster, M. & Gilson, F. Topic modeling in software engineering research. Empir Software Eng 26 , 120 (2021). https://doi.org/10.1007/s10664-021-10026-0

Download citation

Accepted : 29 July 2021

Published : 06 September 2021

DOI : https://doi.org/10.1007/s10664-021-10026-0

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Topic modeling
  • Text mining
  • Natural language processing
  • Literature analysis
  • Find a journal
  • Publish with us
  • Track your research

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Gigascience

Ten recommendations for software engineering in research

Janna hastings.

Cheminformatics and Metabolism, European Molecular Biology Laboratory – European Bioinformatics Institute, Wellcome Trust Genome Campus, CB10 1SD Hinxton, UK

Kenneth Haug

Christoph steinbeck.

Research in the context of data-driven science requires a backbone of well-written software, but scientific researchers are typically not trained at length in software engineering, the principles for creating better software products. To address this gap, in particular for young researchers new to programming, we give ten recommendations to ensure the usability, sustainability and practicality of research software.

Scientific research increasingly harnesses computing as a platform [ 1 ], and the size, complexity, diversity and relatively high availability of research datasets in a variety of formats is a strong driver to deliver well-designed, efficient and maintainable software and tools. As the frontier of science evolves, new tools constantly need to be written; however scientists, in particular early-career researchers, might not have received training in software engineering [ 2 ], thus their code is in jeopardy of being difficult and costly to maintain and re-use.

To address this gap, we have compiled ten brief software engineering recommendations.

Recommendations

Keep it simple.

Every software project starts somewhere. A rule of thumb is to start as simply as you possibly can . Significantly more problems are created by over-engineering than under-engineering. Simplicity starts with design: a clean and elegant data model is a kind of simplicity that leads naturally to efficient algorithms.

Do the simplest thing that could possibly work, and then double-check it really does work.

Test, test, test

For objectivity, large software development efforts assign different people to test software than those who develop it. This is a luxury not available in most research labs, but there are robust testing strategies available to even the smallest project.

Unit tests are software tests which are executed automatically on a regular basis. In test driven development, the tests are written first, serving as a specification and checking every aspect of the intended functionality as it is developed [ 3 ]. One must make sure that unit tests exhaustively simulate all possible – not only that which seems reasonable – inputs to each method.

Do not repeat yourself

Do not be tempted to use the copy-paste-modify coding technique when you encounter similar requirements. Even though this seems to be the simplest approach, it will not remain simple, because important lines of code will end up duplicated. When making changes, you will have to do them twice, taking twice as long, and you may forget an obscure place to which you copied that code, leaving a bug.

Automated tools, such as Simian [ 4 ], can help to detect and fix duplication in existing codebases. To fix duplications or bugs, consider writing a library with methods that can be called when needed.

Use a modular design

Modules act as building blocks that can be glued together to achieve overall system functionality. They hide the details of their implementation behind a public interface, which provides all the methods that should be used. Users should code – and test – to the interface rather than the implementation [ 5 ]. Thus, concrete implementation details can change without impacting downstream users of the module. Application programming interfaces (APIs) can be shared between different implementation providers.

Scrutinise modules and libraries that already exist for the functionality you need. Do not rewrite what you can profitably re-use – and do not be put off if the best candidate third-party library contains more functionality than you need (now).

Involve your users

Users know what they need software to do. Let them try the software as early as possible, and make it easy for them to give feedback, via a mailing list or an issue tracker. In an open source software development paradigm, your users can become co-developers. In closed-source and commercial paradigms, you can offer early-access beta releases to a trusted group.

Many sophisticated methods have been developed for user experience analysis. For example, you could hold an interactive workshop [ 6 ].

Resist gold plating

Sometimes, users ask for too much, leading to feature creep or “gold plating”. Learn to tell the difference between essential features and the long list of wishes users may have. Prioritise aggressively with as broad a collection of stakeholders as possible, perhaps using “game-storming” techniques [ 7 ].

Gold plating is a challenge in all phases of development, not only in the early stages of requirements analysis. In its most mischievous disguise, just a little something is added in every iterative project meeting. Those little somethings add up.

Document everything

Comprehensive documentation helps other developers who may take over your code, and will also help you in the future. Use code comments for in-line documentation, especially for any technically challenging blocks, and public interface methods. However, there is no need for comments that mirror the exact detail of code line-by-line.

It is better to have two or three lines of code that are easy to understand than to have one incomprehensible line, for example see Figure ​ Figure1 1 .

An external file that holds a picture, illustration, etc.
Object name is 13742_2014_62_Fig1_HTML.jpg

An example of incomprehensible code: What does this code actually do? It contains a bug; is it easy to spot?

Write clean code [ 8 ] that you would want to maintain long-term (Figure ​ (Figure2). 2 ). Meaningful, readable variable and method names are a form of documentation.

An external file that holds a picture, illustration, etc.
Object name is 13742_2014_62_Fig2_HTML.jpg

This code peforms the same function, but is written more clearly.

Write an easily accessible module guide for each module, explaining the higher level view: what is the purpose of this module? How does it fit together with other modules? How does one get started using it?

Avoid spaghetti

Since GOTO-like commands fell justifiably out of favour several decades ago [ 9 ], you might believe that spaghetti code is a thing of the past. However, a similar phenomenon may be observed in inter-method and inter-module relationships (see Figures ​ Figures3 3 and ​ and4). 4 ). Debugging – stepping through your code as it executes line by line – can help you diagnose modern-day spaghetti code. Beware of module designs where for every unit of functionality you have to step through several different modules to discover where the error is, and along the way you have long lost the record of what the original method was actually doing or what the erroneous input was. The use of effective and granular logging is another way to trace and diagnose problems with the flow through code modules.

An external file that holds a picture, illustration, etc.
Object name is 13742_2014_62_Fig3_HTML.jpg

An unhealthy module design for ‘biotool‘ with multiple interdependencies between different packages. An addition of functionality to the system (such as supporting a new field) requires updating the software in many different places. Refactoring into a simpler architecture would improve maintainability.

An external file that holds a picture, illustration, etc.
Object name is 13742_2014_62_Fig4_HTML.jpg

The functional units from the biotool architecture can be grouped together in a refactoring process, putting similar functions together. The result may resemble a Model-View-Controller architecture.

Optimise last

Beware of optimising too early. Although research applications are often performance-critical, until you truly encounter the wide range of inputs that your software will eventually run against in the production environment, it may not be possible to anticipate where the real bottlenecks will lie. Develop the correct functionality first, deploy it and then continuously improve it using repeated evaluation of the system running time as a guide (while your unit tests keep checking that the system is doing what it should).

Evolution, not revolution

Maintenance becomes harder as a system gets older. Take time on a regular basis to revisit the codebase, and consider whether it can be renovated and improved [ 10 ]. However, the urge to rewrite an entire system from the beginning should be avoided, unless it is really the only option or the system is very small. Be pragmatic [ 11 ] – you may never finish the rewrite [ 12 ]. This is especially true for systems that were written without following the preceding recommendations.

Use a good version control system (e.g., Git [ 13 ]) and a central repository (e.g., GitHub [ 14 ]). In general, commit early and commit often, and not only when refactoring.

Effective software engineering is a challenge in any enterprise, but may be even more so in the research context. Among other reasons, the research context can encourage a rapid turnover of staff, with the result that knowledge about legacy systems is lost. There can be a shortage of software engineering-specific training, and the “publish or perish” culture may incentivise taking shortcuts.

The recommendations above give a brief introduction to established best practices in software engineering that may serve as a useful reference. Some of these recommendations may be debated in some contexts, but nevertheless are important to understand and master. To learn more, Table ​ Table1 1 lists some additional online and educational resources.

Further reading

This table lists additional online resources where the interested reader can learn more about software engineering best practices in the research context.

Acknowledgements

This commentary is based on a presentation given by JH at a workshop on Software Engineering held at the 2014 annual Metabolomics conference in Tsuruoka, Japan. The authors would like to thank Saravanan Dayalan for organising the workshop and giving JH the opportunity to present. We would furthermore like to thank Robert P. Davey and Chris Mungall for their careful and helpful reviews of an earlier version of this manuscript.

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

JH prepared the initial draft. All authors contributed to, and have read and approved, the final version.

Contributor Information

Janna Hastings, Email: ku.ca.ibe@sgnitsah .

Kenneth Haug, Email: ku.ca.ibe@htennek .

Christoph Steinbeck, Email: ku.ca.ibe@kcebniets .

  • Systems Ph.D.
  • M.Eng. Degree On Campus
  • M.Eng. Degree Distance Learning
  • Systems M.S. Degree
  • Minor in Systems Engineering
  • Professional Certificates
  • Student Organizations
  • Energy Systems M.Eng. Pathway
  • Health Systems Engineering M.Eng. Pathway
  • Systems M.Eng. Projects

Research Topics

  • Research News
  • Ezra's Round Table / Systems Seminar Series
  • Academic Leadership
  • Graduate Field Faculty
  • Graduate Students
  • Staff Directory
  • Ezra Systems Postdoctoral Associates
  • Research Associates
  • Faculty Openings-Systems
  • Get Involved
  • Giving Opportunities
  • Recruit Students
  • Systems Magazine
  • Academic Support
  • Experience and Employment
  • Graduate Services and Activities
  • Mental Health Resources
  • Recruitment Calendar
  • Tuition and Financial Aid
  • Program Description
  • Program Offerings
  • How to Apply
  • Ezra Postdoctoral Associate in Energy Systems Engineering
  • Cornell Systems Summit

Research in Systems Engineering at Cornell covers an extremely broad range of topics, because of this nature, the research takes on a collaborative approach with faculty from many different disciplines both in traditional engineering areas as well as those outside of engineering.

Because of the nature of systems science and engineering, the research takes on a collaborative approach with faculty and students from many different disciplines both in traditional engineering areas as well as those outside of engineering such as health care, food systems, environmental studies, architecture and regional planning, and many others.

Artificial Intelligence

Computational science and engineering, computer systems.

Data Mining

Earth and Atmospheric Science

Energy systems, health systems, heat and mass transfer.

Information Theory and Communication

Infrastructure Systems

Mechanics biological materials, natural hazards.

Programming Languages - CS

Remote Sensing

Robotics and autonomy, satellite systems, scientific computing, sensor and actuators, signal and image processing, space science and engineering, statistics and machine learning, statistical mechanics and molecular simulation, sustainable energy systems, systems and networking - cs, transportation systems engineering, water systems.

Algorithms

Oliver Gao | Civil and Environmental Engineering

David Goldberg | Operations Research and Information Engineering

Adrian Lewis |  Operations Research and Information Engineering

Linda Nozick |  Civil and Environmental Engineering

Francesca Parise | Electrical and Computer Engineering

Mason Peck | Mechanical and Aerospace Engineering

Patrick Reed |  Civil and Environmental Engineering

Samitha Samaranayake |  Civil and Environmental Engineering

Timothy Sands |  Mechanical and Aerospace Engineering

Huseyin Topaloglu |  Operations Research and Information Engineering

Fengqi You | Chemical and Biomolecular Engineering

infrastructure

Mark Campbell | Mechanical and Aerospace Engineering

Kirstin Petersen |  Electrical and Computer Engineering

Patrick Reed | Civil and Environmental Engineering

Computational Science and Engineering

Jose Martinez | Electrical and Computer Engineering

Data science

Data Science

Madeleine Udell | Operations Research and Information Engineering

Earth and atmospheric science

Maha Haji | Mechanical and Aerospace Engineering

Semida Silveira | Systems Engineering

Jery Stedinger |  Civil and Environmental Engineering

Jefferson Tester | Chemical and Biomolecular Engineering

Lang Tong | Electrical and Computer Engineering

Fengqi You |  Chemical and Biomolecular Engineering

Health systems

Shane Henderson | Operations Research and Information Engineering

John Muckstadt |  Operations Research and Information Engineering

Jamol Pender |  Operations Research and Information Engineering

Rana Zadeh |  Human Centered Design

Yiye Zhang |  Weill Cornell Medicine

Heat and mass transfer

Information Theory and Communications

Stephen Wicker | Electrical and Computer Engineering

Infrastructure Systems

Programming Languages – CS

Andrew Myers | Computer Science

Fred Schneider | Computer Science

Remote Sensing

Mason Pack | Mechanical and Aerospace Engineering

Robotics

Mark Campbell |  Mechanical and Aerospace Engineering

Robert Shepherd |  Mechanical and Aerospace Engineering

Satellite systems

Richardo Daziano | Civil and Environmental Engineering

Linda Nozick | Civil and Environmental Engineering

Bart Selman | Computer Science

Statistical Mechanics and Molecular Simulation

Timur Dogan | Arts Architecture and Planning

Systems and Networking - CS

Ken Birman | Computer Science

Hakim Weatherspoon | Computer Science

Transportation Systems Engineering

Richard Geddes | College of Human Ecology

Water systems

The Hired Site and Service use cookies to improve your experience. For further detail on how we use cookies, including necessary cookies already in use and how to remove cookies from your browser check out our Cookie Policy .

Software Engineer Trends in 2024

Subscribe to have the latest data & insights delivered to your inbox.

Top software engineer tech skills, hottest roles, highest paying markets, and more

Each year we produce research on Software Engineer trends in the marketplace. This year, we tossed the 50+ pages gated report out the window. Instead, we’re publishing articles covering software engineer trends in an easier-to-read and share format. We’re starting with trends in software engineer and developer tech skills. Next, we’ll explore the shift in software engineer specializations, or subrole, as they’re called on the Hired tech recruitment platform .

AI and GenAI are big, broad topics, but we’re diving into them. We’ll ask subject matter experts about AI’s impact on software engineering roles within an organization and on functions and processes. What do employers need to know for their tech recruiting and how can software engineers leverage it to be valued employees and competitive job candidates? We’ll find out.

Next we’ll reveal new data on software engineer salaries. We’ll break it down by software engineer specialization, which tech skills garner higher compensation, and how salaries vary based on location. Then, we’ll examine which tech job markets are most attractive to employers and jobseekers . We’ll also share research on workplace models and compare which are favored by employers versus software engineers.

Join us, as we explore Software Engineer trends in 2024.

Will AI Replace Programmers Developers Software Engineers | Hired.com

Will AI Replace Programmers? New Expert Insights

Impacts of ai/genai on the software engineer job market and profession.

AI is a deep pool, but we’re diving in. Get new data and hear from CTOs on the impact of AI on the software engineering industry and how to build effective teams in 2024.

Related Reports

Software Engineer Specializations Trends: 2024 Report

Trends in Software Engineer Specializations: 2024 Report

Roles employers want to fill and software engineering specializations most likely to secure an interview.

New data showing trends in software engineer roles and emerging specializations combined with expert insights.

Hired Trends in Software Engineer Tech Skills

Trends in Software Engineer Tech Skills: 2024 Report

What employers are looking for & which tech skills garner more interview requests.

Which specific tech skills are in high demand? New specializations are emerging, and some tech experts tell us AI isn’t as scary as some media outlets want you to believe.  

More research on Software Engineer trends

Resources Coverfuture of hiring

The Future of Tech Hiring: 8 Bold Predictions for 2024

Hired Releases State of Wage Inequality in the Tech Industry 2023 Report

State of Wage Inequality in the Tech Industry 2023 Research Study

Resources Cover-UK-2023-SOTS

Hired’s 2023 State of UK Tech Salaries

Methodology, faqs, and more..., where did this data come from, methodology.

This report is based on proprietary data gathered and analyzed by Hired’s data science teams. For these report articles, Hired examined software engineering candidate interview requests (IVR) and salary data from January 2022 through December 2023 inclusive. The data included reflects over 65,000 candidates and 330,000 interview requests between companies and software engineers on Hired during this time period. A minimum of 500 interview requests, in a given market, were required for salary-related data to be valid and included in the report. 

Of note, for the sections on the demand for certain software engineering roles and skills, candidates can have multiple subroles as well as several skills associated with their candidate profile (e.g., a candidate with a primary role of software engineer can have both NLP engineer and machine learning engineer subroles on their profile). Positions can also have multiple subroles. Candidates can have multiple skills associated with their candidate profile (Java, C, C++, etc.) All salaries indicated reflect an employer’s salary at the time of the interview request. 

Employer sizes are denoted as follows: 

eSMB = 1-75 employees, SMB = 76-300, MM = 301-1000, ENT = 1001+

What is Hired.com?

Hired is a two-sided marketplace that helps high-quality, active jobseekers in tech and sales find new roles while helping employers find the right candidates efficiently. Check out the blog What is Hired.com and About Hired for more.

Looking back: Big Transitions in the Tech Industry: Hired's 2023 State of Software Engineers report

Overview from hired’s ceo.

As we reflected on 2022 and the data, so many images came to mind – a rollercoaster, a pendulum, a weathervane, a two-sided coin. It’s been a challenging time for the tech industry. Despite this, the tech industry unemployment rate continues to improve, dropping from 1.8% in December to 1.5% in January. This rate suggests many laid-off workers were quickly reabsorbed into the workforce and that several of the layoffs in the tech sector are non-technical workers, such as sales, marketing, or support roles.

We believe virtually every company, at this point, is a tech company. You may be in healthcare, manufacturing, or hospitality, but as part of this global economy, you have a tech team. We’re faced with a lot of big transitions, but we know software engineers are resilient, adaptable and creative problem-solvers. Software engineering is an incredible career choice, providing opportunities to touch a multitude of industries and solve big issues.

In fact, U.S. News & World Report recently named it #1 in their list of 100 Top Jobs. We create this report every year to help talent professionals and software engineers understand the hiring climate, as well as what’s top of mind for employers and engineers. It’s part of our vision to make hiring more equitable, efficient, and transparent for all. We’re here to support both employers and jobseekers every step of the way.

Hired's 2023 UK State of Software Engineers

It’s been a challenging year for organisations across the United Kingdom. 2022 began with the struggle to find and hire talent. As the year went on, the energy crisis worsened, national leadership saw significant turnover, and economic policies shifted. Many companies pulled back on scaling teams and even laid off workers they’d hired only months before.

Despite these obstacles, signs indicate global remote hiring remains confident in Europe, with employers especially interested in experienced talent with specialised skill sets. At Hired, we aspire to make hiring as efficient, equitable, and transparent as possible, for the best experience for all parties. Whether backfilling roles or advancing initiatives, we help companies of all sizes fill tech and sales roles with unbiased insights, diversity, equity & inclusion (DEI) tools, skills assessments, and dedicated Customer Success managers.

In this report, we’ve specifically studied software engineers on the Hired marketplace to identify the top issues and trends. This includes looking at the most in-demand skills and roles, salaries, and shifts to remote hiring. We provide more details in the Methodology section, but this is based on more than 76K interview requests (IVRs) between employers and (more than 9K) candidates on the Hired marketplace.

We’ve supplemented the platform data with survey responses from more than 1300 software engineers and 120 talent professionals and Hiring managers. We aim to provide organisations with takeaways to better attract and retain talent and insights to help software engineers succeed in their careers.

Hired's 2022 State of Software Engineers

Amid the ongoing tech talent shortage and record-high demands from companies eager to fill open roles, software engineers on Hired received more than twice the amount of interview requests on average in 2021 than they did in 2020. This competitive hiring market continues to put pressure on companies to offer compelling salaries and benefits and extend their talent search to hire remote software engineers outside of big tech hubs, expanding and distributing teams globally. For software engineers, upskilling is key to thrive in this global job market and the more specialized their skill set, the higher the demand and salary.

Josh Brenner, CEO of Hired

Employers: How does Hired support tech recruiting and hiring?

If you’re new to Hired, check out our Employers page explaining exactly this, featuring FAQs, customer success stories, and more. Hired offers multiple products and services (DEI features, coding challenge campaigns , talent sourcing , etc.) along with solutions for hiring managers , talent acquisition teams , DEI leaders , employer branding , and enterprise hiring .

Jobseekers: When should I sign up for Hired? Is it free?

You may create your Hired profile at any time. Please specify during the sign up process when you would like your profile to become visible to the employers on our platform. Employers on Hired are looking to fill their open positions as soon as possible, so select a date for your profile to be visible to the employers on Hired that falls about one month before your preferred job start date. Learn more in this blog, How to Get Approved on Hired.

Is it free for talent? Yes! There is no charge for jobseekers from start to finish. Our services are paid for by companies who value the ability to source qualified and engaged talent efficiently. Learn more about how Hired works for jobseekers.

Hired Software Engineer Trends 2024

Subscribe for more Software Engineer trends and insights

Be the first to know as we release new data-driven and insightful articles.

AI system used to improve Nashville public transit takes top honors at international research conference

Lucas Johnson

Lucas Johnson

Jun 3, 2024, 3:09 PM

A software system developed by Vanderbilt researchers to help improve operations of Nashville’s public transportation network won “Best Paper” at the 15th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS) , held in Hong Kong May 13-16.

Currently, WeGo Nashville has 160 buses that cover 35 routes. However, its ridership is steadily increasing, leading to higher chances of disruptions like overcrowding, mechanical failures, and accidents.

best research topics software engineering

To address the problems, Abhishek Dubey , associate professor of computer science and electrical and computer engineering, and his team designed a cloud-based tool called Vectura that incorporates artificial intelligence to provide insights into passenger flows, service delays, and the operational efficiency of WeGo service, which is testing the technology.

“Vectura leverages the data feeds already generated by modern transit networks, transforming raw figures into intuitive, interactive visualizations that highlight key performance metrics,” said Dubey, senior research scientist at the Institute for Software Integrated Systems (ISIS) . “By doing so, it enables operators to quickly identify trends, pinpoint inefficiencies, and make informed decisions to enhance service quality.”

best research topics software engineering

In the paper, the team also described using decision-making processes where the focus extends beyond immediate outcomes to consider long-term consequences of actions — an approach known as “non-myopic sequential decision procedures.”

Dubey explained that by doing so, the software can help anticipate problems and proactively position buses near areas with high likelihoods of disruptions and determine which vehicle to dispatch to solve a particular problem. Results showed 2% more passengers served and a reduction in deadhead miles by 40%.

“You don’t know what will happen in the future, but you plan for it by using some idea of what the environment is,” said Jose Talusan , lead author on the paper and a research scientist in ISIS. “Once a decision is performed, you wait and see how the environment reacts and then consider this for your next round of decisions.”

Dan Freudberg, WeGo’s deputy chief operating officer, is a co-author on the paper.

Funding for the research was provided through the Federal Transit Authority and National Science Foundation.

Contact: Lucas Johnson,  [email protected]

Explore Story Topics

  • Computer Science
  • Electrical and Computer Engineering
  • Home Features
  • Abhishek Dubey
  • Institute for Software Integrated Systems
  • Jose Paolo Talusan

IMAGES

  1. Top 10 Software Engineer Research Topics for 2024

    best research topics software engineering

  2. 150+ Best Research Paper Topics For Software Engineering

    best research topics software engineering

  3. List of Software Engineering Dissertation Topics and Titles

    best research topics software engineering

  4. Software engineering

    best research topics software engineering

  5. Latest Software Engineering Thesis Topics For Research Scholars

    best research topics software engineering

  6. Breakdown of Topics for the Software Engineering Models and Methods KA

    best research topics software engineering

VIDEO

  1. TOP 10 BEST RESEARCH TOPICS FOR MEDICAL STUDENTS IN 2024

  2. Top 15 Best Research Topics for microbiology for researchers and M.sc. students #study #yt #video

  3. SOFTWARE ENGINEERING IMPORTANT QUESTIONS // BTECH

  4. Research Topics in Business Management

  5. Research Without Programming Theoretical Research in Computer Science

  6. Software Engineering Aktu

COMMENTS

  1. Top 10 Software Engineer Research Topics for 2024

    These research topics include various software development approaches, quality of software, testing of software, maintenance of software, security measures for software, machine learning models in software engineering, DevOps, and architecture of software. Each of these software engineer research topics has distinct problems and opportunities ...

  2. Software Engineering's Top Topics, Trends, and Researchers

    For this theme issue on the 50th anniversary of software engineering (SE), Redirections offers an overview of the twists, turns, and numerous redirections seen over the years in the SE research literature. Nearly a dozen topics have dominated the past few decades of SE research—and these have been redirected many times. Some are gaining popularity, whereas others are becoming increasingly ...

  3. 150 Best Research Paper Topics For Software Engineering

    Best Research Paper Topics on Software. Software Engineering Management Unified Software Development Process and Extreme ProgrammingThere are a lot of difficulties with managing the development of software for web-based applications and projects for systems integration that were completed in recent times.

  4. Architecting the Future of Software Engineering: A Research and

    In close collaboration with our advisory board and other leaders in the software engineering community, we have developed a research roadmap with six focus areas. Figure 1 shows those areas and outlines a suggested course of research topics to undertake. Short descriptions of each focus area and its challenges follow.

  5. Trending Topics in Software Engineering

    ACM SIGSOFT Software Engineering Notes Volume 47, Issue 3. July 2022. 28 pages. ISSN: 0163-5948. DOI: 10.1145/3539814. Editor: Jacopo Soldani. Issue's Table of Contents. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed ...

  6. Journal of Software Engineering Research and Development

    They wanted to define values and basic principles for better software development. On top of being brought into focus, the ... Philipp Hohl, Jil Klünder, Arie van Bennekum, Ryan Lockard, James Gifford, Jürgen Münch, Michael Stupperich and Kurt Schneider. Journal of Software Engineering Research and Development 2018 6 :15.

  7. Software Engineering

    Software Engineering. At Google, we pride ourselves on our ability to develop and launch new products and features at a very fast pace. This is made possible in part by our world-class engineers, but our approach to software development enables us to balance speed and quality, and is integral to our success. Our obsession for speed and scale is ...

  8. software engineering Latest Research Papers

    Find the latest published documents for software engineering, Related hot topics, top authors, the most cited documents, and related journals. ScienceGate; Advanced Search; Author Search; ... However, there is a lack of a comprehensive comparative summary of existing code search approaches. To understand the research trends in existing code ...

  9. The Hitchhiker's Guide to Research Software Engineering: From PhD to

    A study conducted by the Royal Society in 2010 reported that only 3.5% of PhD graduates end up in permanent research positions in academia. Leaving aside the roots of the brain drain on Universities, it is a compelling statistic that the vast majority of post-graduates end up leaving academia for industry at some point in their career.

  10. Machine Learning for Software Engineering

    Keywords: Artificial Intelligence, Machine Learning, Software Engineering, Software Development . Important Note: All contributions to this Research Topic must be within the scope of the section and journal to which they are submitted, as defined in their mission statements.Frontiers reserves the right to guide an out-of-scope manuscript to a more suitable section or journal at any stage of ...

  11. Computer Science Research Topics (+ Free Webinar)

    Overview: CompSci Research Topics. Algorithms & data structures. Artificial intelligence ( AI) Computer networking. Database systems. Human-computer interaction. Information security (IS) Software engineering. Examples of CompSci dissertation & theses.

  12. Software Engineering and Intelligent Systems

    Software engineering and intelligent systems are two dynamic and interrelated fields that have witnessed significant advancements and transformations in recent years. The convergence of these domains has led to the development of innovative applications and solutions that are shaping various industries, from healthcare and finance to transportation and manufacturing.This Research Topic aims to ...

  13. Software Engineer Research Paper Topics 2021: Top 5

    Thus, to help you land on the best topic for your needs, we have listed the top 5 software engineer research paper topics in the next sections. Machine Learning. Machine learning is one of the most used research topics of software engineers. If you're not yet familiar with this, it's a field that revolves around producing programs that ...

  14. Undergraduate Research Topics

    Software and best practices for computer science education and study, especially Princeton's 126/217/226 sequence or MOOCs development; Sports analytics and/or crowd-sourced computing; Radhika Nagpal, F316 Engineering Quadrangle. Available for single-semester IW and senior thesis advising, 2024-2025. Research areas: control, robotics and ...

  15. Software Engineering and Applications

    Advanced topics in software engineering. Agile, DevOps models, practices, challenges; ... making it an important contribution to software testing research. ... Then we applied search-based optimizer i.e., random forest ensemble (RFE) to get the best features set for a software prediction model and we get 30% to 50% significant results compared ...

  16. Software Engineering research ideas / topics

    Best Research area in software engineering. Data mining semantic-web-mining. Distributed computing. Database. Distributed system. Data warehousing. Green computing. GUI-graphical-user-interface ...

  17. Top 7 Software Engineering Trends for 2023

    Edge Computing. In the era of rapidly growing data volumes and increasing demand for real-time processing, edge computing has emerged as a crucial software engineering trend that supports cloud optimization and innovation within the IoT space. Edge computing brings computing resources closer to the data source, reducing latency, enhancing ...

  18. Papers for Software Engineers

    A curated list of papers that may be of interest to Software Engineering students or professionals. See the sources and selection criteria below. List of papers by topic. Von Neumann's First Computer Program. Knuth (1970). Computer History; Early Programming. The Education of a Computer. Hopper (1952). Recursive Programming.

  19. An Analysis of Research in Software Engineering:

    This paper presents a software-aided method for assessment and trend analysis, which can be used in software engineering as well as other research fields in computer science (or other disciplines). The method proposed in this paper is modular and automated compared with the method in prior studies [7, 10-22, 2].

  20. Research Topics in Software Engineering

    Overview. This seminar is an opportunity to become familiar with current research in software engineering and more generally with the methods and challenges of scientific research. Each student will be asked to study some papers from the recent software engineering literature and review them. This is an exercise in critical review and analysis.

  21. Topic modeling in software engineering research

    Topic modeling using models such as Latent Dirichlet Allocation (LDA) is a text mining technique to extract human-readable semantic "topics" (i.e., word clusters) from a corpus of textual documents. In software engineering, topic modeling has been used to analyze textual data in empirical studies (e.g., to find out what developers talk about online), but also to build new techniques to ...

  22. Ten recommendations for software engineering in research

    Research in the context of data-driven science requires a backbone of well-written software, but scientific researchers are typically not trained at length in software engineering, the principles for creating better software products. To address this gap, in particular for young researchers new to programming, we give ten recommendations to ...

  23. Research Topics

    Research Topics. Research in Systems Engineering at Cornell covers an extremely broad range of topics, because of this nature, the research takes on a collaborative approach with faculty from many different disciplines both in traditional engineering areas as well as those outside of engineering.

  24. Best Software Development Courses Online [2024]

    Software development encompasses all of the activities required for software design, deployment, maintenance, and support. Every video game, mobile app, and work-related computer program you've ever used goes through this process. Software development starts by working with users and other stakeholders to determine what the software needs to do.

  25. Software Engineer Trends: New Data for 2024

    We're starting with trends in software engineer and developer tech skills. Next, we'll explore the shift in software engineer specializations, or subrole, as they're called on the Hired tech recruitment platform. AI and GenAI are big, broad topics, but we're diving into them. We'll ask subject matter experts about AI's impact on ...

  26. AI system used to improve Nashville public transit takes top honors at

    A software system developed by Vanderbilt researchers to help improve operations of Nashville's public transportation network won "Best Paper" at the 15th ACM/IEEE International Conference on Cyber-Physical Systems (ICCPS), held in Hong Kong May 13-16. Currently, WeGo Nashville has 160 buses that cover 35 routes. However, its ridership is

  27. Top Online Courses and Certifications

    5. 84. Find Courses and Certifications from top universities like Yale, Michigan, Stanford, and leading companies like Google and IBM. Join Coursera for free and transform your career with degrees, certificates, Specializations, & MOOCs in data science, computer science, business, and hundreds of other topics.

  28. Best Coding Bootcamps Online Of 2024

    Best full-stack developer bootcamp: Tech Elevator Full-Time Coding Bootcamp. Best software engineering bootcamp: Coding Temple (Flex) Software Engineering Bootcamp. Best Java bootcamp: Devmountain ...

  29. Admission criteria for graduate psychology programs are changing

    Over the last few years, graduate psychology degree programs, both at the master's and doctoral levels, have shifted their admission criteria from an emphasis on standardized testing to components that reflect applicants' experiences. According to the most recent edition of Graduate Study in Psychology, 1 for 2022-23 applications ...