codefile
is a Go library for detecting the programming language of a given file. It uses content-based detection with weighted keyword matching, ensuring robust and accurate identification, even for files without extensions.
It uses TOFU
- Content-Based Detection:
- Detects programming languages by inspecting file content for unique constructs and patterns.
- Weighted Scoring:
- Each language feature is assigned a weight to improve detection accuracy.
- Efficient Scanning:
- Only inspects the first 20 lines of a file for optimal performance.
Install the package using go get
:
go get github.com/Agent-Hellboy/codefile
Basic Language Detection Detect the programming language of a file:
package main
import (
"fmt"
"github.com/Agent-Hellboy/codefile"
)
func main() {
filePath := "example.py"
language, ok := codefile.DetectCodeFileType(filePath)
if ok {
fmt.Printf("The language of the file is: %s\n", language)
} else {
fmt.Println("Language could not be detected.")
}
}
The language of the file is: Go
The library supports the following programming languages out of the box:
- Python
- Go
- C++
- Java
- JavaScript
- TypeScript
- Shell
- Steps of TOFU Algorithm
1 Tokenization:
Parse the file content and split it into meaningful tokens (e.g., keywords, operators, literals). Consider language-specific symbols like ;, {}, and ().
2 Frequency Analysis:
Count the occurrences of each token in the file. Use this frequency to weigh the probability of a match for each programming language.
3 Weighted Matching:
Compare the token distribution with predefined language profiles. Each profile contains common keywords, operators, and constructs with associated weights for a language.
4 Confidence Scoring:
Compute a confidence score for each language based on: Token frequency. Unique constructs (e.g., package main for Go, #include for C++). Weighted patterns.
5 Threshold Comparison:
If the highest confidence score exceeds a predefined threshold, classify the file as that language. If no score exceeds the threshold, classify the language as "Unknown."