Understanding Code Scanning and Code Matching
Open source code scanning and matching tools and techniques fall under the more general category of code analysis tools. Code analysis tools can operate either on running code or on code bases. Examples of dynamic code analysis tools would be memory and performance profilers; these tools collect data while the code is running. Static analysis tools collect data by scanning the source and binary files used to build the software. Static analysis tools can be used to identify poor programming practices, design issues, and to uncover vulnerabilities. More recently, static analysis techniques have been used to discover open source code.
Scanning code for open source provides a transparent view of an organization's code assets, identifying what outside code is being used within specific applications. These insights can help organizations optimize the use of open source code by making it "visible," enabling organizations to standardize, reducing duplicate code and facilitating re-use and reducing maintenance costs. And since all open source code comes with a license, many users are also interested in understanding the legal obligations, making license detection another important part of open source code scanning.
Multiple techniques are often required to do an effective job of identifying code matches. These techniques include:
- Code "printing" where cryptographic hash values of code are used (e.g. md5sum, SHA-1, etc.)
- String searches, where different techniques are used to search for matching strings, or patterns, of code
- Dependency analysis, in which pointers to code not included in the scanned files are discovered
