diff --git a/CONFIGURATION.md b/CONFIGURATION.md new file mode 100644 index 0000000..196acfc --- /dev/null +++ b/CONFIGURATION.md @@ -0,0 +1,18 @@ +# configuration +the `ext-cfg.yaml` file allows you to set custom options for each extractor. this is useful for advanced configuration of the bot, mostly related to network settings. +> [!NOTE] +> this configuration will override the global configuration. this is useful in case you want to set a global proxy in the `.env` file and then override it for specific extractors in the `ext-cfg.yaml` file. + +## structure +the file uses yaml format. each top-level key is the name of an extractor. under each extractor, you can define options supported by that extractor, for example: +```yaml +instagram: + edge_proxy_url: https://example.com + impersonate: true +``` + +## available options +* `http_proxy` | `https_proxy`: the http(s) proxy to use for this extractor. see [proxying](README.md#proxying) for more information. +* `no_proxy`: the domains that should not be proxied for this extractor. +* `edge_proxy_url`: the url of the edge proxy to use for this extractor. see [edge proxy](EDGEPROXY.md) for more information. +* `impersonate`: whether to impersonate chrome. this is useful for extractors that require specific browsers' fingerprints to work. \ No newline at end of file diff --git a/EDGEPROXY.md b/EDGEPROXY.md new file mode 100644 index 0000000..7514dd9 --- /dev/null +++ b/EDGEPROXY.md @@ -0,0 +1,41 @@ +# edge proxy +edge proxy is an optional feature that allows routing some extractor requests through a custom proxy endpoint, instead of a classic http/https proxy. this is useful if you want to centralize or control the traffic of certain platforms via your own proxy service, for example to bypass geo-restrictions, add caching, logging, or other customizations. + +## configuration +edge proxy is configured via the `ext-cfg.yaml` file. +you can set the proxy url for each extractor that supports it. +example: + +```yaml +instagram_share: + edge_proxy_url: https://example.com + +reddit: + https_proxy: https://example.com +``` + +## response format +the edge proxy must respond with a JSON object in the following format (see [`models.EdgeProxyResponse`](models/edgeproxy.go)). + +```json +{ + "url": "https://example.com/resource", + "status_code": 200, + "text": "response body", + "headers": { + "Content-Type": "application/json" + }, + "cookies": [ + "cookie1=value1; Path=/; HttpOnly", + "cookie2=value2; Path=/" + ] +} +``` + +## http proxy vs edge proxy +the main difference between http proxy and edge proxy is that http proxy is a standard proxy that forwards requests and responses, while edge proxy is a custom proxy that can modify the requests and responses in any way you want. + +## notes +* edge proxy is for advanced use and not required for most users. +* this feature is experimental and may change in the future. +* you can check full implementation of the edge proxy in the [`util/edgeproxy`](util/edgeproxy.go) package. \ No newline at end of file diff --git a/README.md b/README.md index f50d06b..a04013c 100644 --- a/README.md +++ b/README.md @@ -1,7 +1,7 @@ # govd -a telegram bot for downloading media from various platforms +a telegram bot for downloading media from various platforms. -this project draws significant inspiration from [yt-dlp](https://github.com/yt-dlp/yt-dlp) +this project draws significant inspiration from [yt-dlp](https://github.com/yt-dlp/yt-dlp). - official instance: [@govd_bot](https://t.me/govd_bot) - support group: [govdsupport](https://t.me/govdsupport) @@ -12,7 +12,7 @@ this project draws significant inspiration from [yt-dlp](https://github.com/yt-d * [installation](#installation) * [build](#build) * [docker](#docker-recommended) -* [options](#options) +* [configuration](#configuration) * [authentication](#authentication) * [proxying](#proxying) * [todo](#todo) @@ -30,7 +30,7 @@ this project draws significant inspiration from [yt-dlp](https://github.com/yt-d > [!NOTE] > there's no official support for windows yet. if you want to run the bot on it, please follow [docker installation](#docker-recommended). -1. clone the repository +1. clone the repository: ```bash git clone https://github.com/govdbot/govd.git && cd govd ``` @@ -68,7 +68,9 @@ this project draws significant inspiration from [yt-dlp](https://github.com/yt-d docker compose up -d ``` -# options +# configuration +you can configure the bot using the `.env` file. here are the available options: + | variable | description | default | |-------------------------------|----------------------------------------------|---------------------------------------| | DB_HOST | database host | localhost | @@ -87,15 +89,16 @@ this project draws significant inspiration from [yt-dlp](https://github.com/yt-d | REPO_URL | project repository url | https://github.com/govdbot/govd | | PROFILER_PORT | port for profiler http server (pprof) | 0 _(disabled)_ | -you can configure specific extractors options with `ext-cfg.yaml` file. documentation is not available yet, but you can check the source code for more information. +you can configure specific extractors options with `ext-cfg.yaml` file ([learn more](CONFIGURATION.md)). > [!IMPORTANT] > to avoid limits on files, you should host your own telegram botapi and set `BOT_API_URL` variable according. public bot instance is currently running under a botapi fork, [tdlight-telegram-bot-api](https://github.com/tdlight-team/tdlight-telegram-bot-api), but you can use the official botapi client too. # proxying -there are two types of proxying available: http and edge. +there are two types of proxying available: * **http proxy**: this is a standard http proxy that can be used to route requests through a proxy server. you can set the `HTTP_PROXY` and `HTTPS_PROXY` environment variables to use this feature. (SOCKS5 is supported too) -* **edge proxy**: this is a custom proxy that is used to route requests through a specific url. currenrly, you can only set this proxy with `ext-cfg.yaml` file. this is useful for routing requests through a specific server or service. however, this feature is not totally implemented yet. +* **edge proxy**: this is a custom proxy that is used to route requests through a specific url. currenrly, you can only set this proxy with `ext-cfg.yaml` file ([learn more](EDGEPROXY.md)). + > [!TIP] > by settings `NO_PROXY` environment variable, you can specify domains that should not be proxied. diff --git a/models/edgeproxy.go b/models/edgeproxy.go index 6f139df..c1900ba 100644 --- a/models/edgeproxy.go +++ b/models/edgeproxy.go @@ -1,6 +1,6 @@ package models -type ProxyResponse struct { +type EdgeProxyResponse struct { URL string `json:"url"` StatusCode int `json:"status_code"` Text string `json:"text"` diff --git a/models/ext.go b/models/ext.go index 44bfe55..a35a151 100644 --- a/models/ext.go +++ b/models/ext.go @@ -39,4 +39,5 @@ type ExtractorConfig struct { HTTPSProxy string `yaml:"https_proxy"` NoProxy string `yaml:"no_proxy"` EdgeProxyURL string `yaml:"edge_proxy_url"` + Impersonate bool `yaml:"impersonate"` } diff --git a/util/edgeproxy.go b/util/edgeproxy.go new file mode 100644 index 0000000..5a37695 --- /dev/null +++ b/util/edgeproxy.go @@ -0,0 +1,141 @@ +package util + +import ( + "bytes" + "fmt" + "govd/models" + "io" + "net/http" + "net/url" + "strconv" + "time" + + "github.com/bytedance/sonic" +) + +type EdgeProxyClient struct { + client *http.Client + proxyURL string +} + +func NewEdgeProxyFromConfig(cfg *models.ExtractorConfig) *EdgeProxyClient { + var baseClient *http.Client + if cfg.Impersonate { + baseClient = NewChromeClient() + } else { + baseClient = &http.Client{ + Transport: GetBaseTransport(), + Timeout: 60 * time.Second, + } + } + return &EdgeProxyClient{ + client: baseClient, + proxyURL: cfg.EdgeProxyURL, + } +} + +func NewEdgeProxy( + proxyURL string, +) *EdgeProxyClient { + return &EdgeProxyClient{ + client: &http.Client{ + Transport: GetBaseTransport(), + Timeout: 60 * time.Second, + }, + proxyURL: proxyURL, + } +} + +func (c *EdgeProxyClient) Do(req *http.Request) (*http.Response, error) { + if c.proxyURL == "" { + return nil, fmt.Errorf("proxy URL is not set") + } + + targetURL := req.URL.String() + encodedURL := url.QueryEscape(targetURL) + proxyURLWithParam := c.proxyURL + "?url=" + encodedURL + + bodyBytes, err := readRequestBody(req) + if err != nil { + return nil, err + } + + proxyReq, err := http.NewRequest( + req.Method, + proxyURLWithParam, + bytes.NewBuffer(bodyBytes), + ) + if err != nil { + return nil, fmt.Errorf("error creating proxy request: %w", err) + } + + copyHeaders(req.Header, proxyReq.Header) + + proxyResp, err := c.client.Do(proxyReq) + if err != nil { + return nil, fmt.Errorf("proxy request failed: %w", err) + } + defer proxyResp.Body.Close() + + return parseProxyResponse(proxyResp, req) +} + +func readRequestBody(req *http.Request) ([]byte, error) { + if req.Body == nil { + return nil, nil + } + + bodyBytes, err := io.ReadAll(req.Body) + if err != nil { + return nil, fmt.Errorf("error reading request body: %w", err) + } + + req.Body.Close() + req.Body = io.NopCloser(bytes.NewBuffer(bodyBytes)) + + return bodyBytes, nil +} + +func copyHeaders(source, destination http.Header) { + for name, values := range source { + for _, value := range values { + destination.Add(name, value) + } + } +} + +func parseProxyResponse(proxyResp *http.Response, originalReq *http.Request) (*http.Response, error) { + body, err := io.ReadAll(proxyResp.Body) + if err != nil { + return nil, fmt.Errorf("error reading proxy response: %w", err) + } + + var response models.EdgeProxyResponse + if err := sonic.ConfigFastest.Unmarshal(body, &response); err != nil { + return nil, fmt.Errorf("error parsing proxy response: %w", err) + } + + resp := &http.Response{ + StatusCode: response.StatusCode, + Status: strconv.Itoa(response.StatusCode) + " " + http.StatusText(response.StatusCode), + Body: io.NopCloser(bytes.NewBufferString(response.Text)), + Header: make(http.Header), + Request: originalReq, + } + + parsedResponseURL, err := url.Parse(response.URL) + if err != nil { + return nil, fmt.Errorf("error parsing response URL: %w", err) + } + resp.Request.URL = parsedResponseURL + + for name, value := range response.Headers { + resp.Header.Set(name, value) + } + + for _, cookie := range response.Cookies { + resp.Header.Add("Set-Cookie", cookie) + } + + return resp, nil +} diff --git a/util/fingerprint.go b/util/fingerprint.go index 5198842..c059ccf 100644 --- a/util/fingerprint.go +++ b/util/fingerprint.go @@ -3,6 +3,7 @@ package util import ( "crypto/tls" "net/http" + "time" ) func ChromeClientHelloSpec() *tls.ClientHelloInfo { @@ -72,13 +73,12 @@ func NewChromeClient() *http.Client { Renegotiation: tls.RenegotiateNever, } - transport := &http.Transport{ - TLSClientConfig: tlsConfig, - // chrome enables HTTP/2 - ForceAttemptHTTP2: true, - } + transport := GetBaseTransport() + transport.TLSClientConfig = tlsConfig + // chrome uses HTTP/2, but it's enabled by default in base transport return &http.Client{ Transport: transport, + Timeout: 60 * time.Second, } } diff --git a/util/http.go b/util/http.go index b6a9f27..2c93078 100644 --- a/util/http.go +++ b/util/http.go @@ -1,11 +1,8 @@ package util import ( - "bytes" - "fmt" "govd/config" "govd/models" - "io" "log" "net" "net/http" @@ -13,8 +10,6 @@ import ( "strings" "sync" "time" - - "github.com/bytedance/sonic" ) var ( @@ -26,14 +21,14 @@ var ( func GetDefaultHTTPClient() *http.Client { defaultClientOnce.Do(func() { defaultClient = &http.Client{ - Transport: createBaseTransport(), + Transport: GetBaseTransport(), Timeout: 60 * time.Second, } }) return defaultClient } -func createBaseTransport() *http.Transport { +func GetBaseTransport() *http.Transport { return &http.Transport{ Proxy: http.ProxyFromEnvironment, DialContext: (&net.Dialer{ @@ -65,26 +60,27 @@ func GetHTTPClient(extractor string) models.HTTPClient { var client models.HTTPClient if cfg.EdgeProxyURL != "" { - client = NewEdgeProxyClient(cfg.EdgeProxyURL) + client = NewEdgeProxyFromConfig(cfg) } else { - client = createClientWithProxy(cfg) + client = NewClientFromConfig(cfg) } - extractorClients[extractor] = client return client } -func createClientWithProxy(cfg *models.ExtractorConfig) *http.Client { - transport := createBaseTransport() - +func NewClientFromConfig(cfg *models.ExtractorConfig) *http.Client { + var baseClient *http.Client + if cfg.Impersonate { + baseClient = NewChromeClient() + } else { + baseClient = GetDefaultHTTPClient() + } + transport := GetBaseTransport() if cfg.HTTPProxy != "" || cfg.HTTPSProxy != "" { configureProxyTransport(transport, cfg) } - - return &http.Client{ - Transport: transport, - Timeout: 60 * time.Second, - } + baseClient.Transport = transport + return baseClient } func configureProxyTransport( @@ -100,20 +96,16 @@ func configureProxyTransport( log.Printf("warning: invalid HTTP proxy URL '%s': %v\n", cfg.HTTPProxy, err) } } - if cfg.HTTPSProxy != "" { httpsProxyURL, err = url.Parse(cfg.HTTPSProxy) if err != nil { log.Printf("warning: invalid HTTPS proxy URL '%s': %v\n", cfg.HTTPSProxy, err) } } - if httpProxyURL == nil && httpsProxyURL == nil { return } - noProxyList := parseNoProxyList(cfg.NoProxy) - transport.Proxy = func(req *http.Request) (*url.URL, error) { if shouldBypassProxy(req.URL.Hostname(), noProxyList) { return nil, nil @@ -155,112 +147,3 @@ func shouldBypassProxy(host string, noProxyList []string) bool { } return false } - -type EdgeProxyClient struct { - client *http.Client - proxyURL string -} - -func NewEdgeProxyClient(proxyURL string) *EdgeProxyClient { - return &EdgeProxyClient{ - client: &http.Client{ - Transport: createBaseTransport(), - Timeout: 60 * time.Second, - }, - proxyURL: proxyURL, - } -} - -func (c *EdgeProxyClient) Do(req *http.Request) (*http.Response, error) { - if c.proxyURL == "" { - return nil, fmt.Errorf("proxy URL is not set") - } - - targetURL := req.URL.String() - encodedURL := url.QueryEscape(targetURL) - proxyURLWithParam := c.proxyURL + "?url=" + encodedURL - - bodyBytes, err := readRequestBody(req) - if err != nil { - return nil, err - } - - proxyReq, err := http.NewRequest( - req.Method, - proxyURLWithParam, - bytes.NewBuffer(bodyBytes), - ) - if err != nil { - return nil, fmt.Errorf("error creating proxy request: %w", err) - } - - copyHeaders(req.Header, proxyReq.Header) - - proxyResp, err := c.client.Do(proxyReq) - if err != nil { - return nil, fmt.Errorf("proxy request failed: %w", err) - } - defer proxyResp.Body.Close() - - return parseProxyResponse(proxyResp, req) -} - -func readRequestBody(req *http.Request) ([]byte, error) { - if req.Body == nil { - return nil, nil - } - - bodyBytes, err := io.ReadAll(req.Body) - if err != nil { - return nil, fmt.Errorf("error reading request body: %w", err) - } - - req.Body.Close() - req.Body = io.NopCloser(bytes.NewBuffer(bodyBytes)) - - return bodyBytes, nil -} - -func copyHeaders(source, destination http.Header) { - for name, values := range source { - for _, value := range values { - destination.Add(name, value) - } - } -} - -func parseProxyResponse(proxyResp *http.Response, originalReq *http.Request) (*http.Response, error) { - body, err := io.ReadAll(proxyResp.Body) - if err != nil { - return nil, fmt.Errorf("error reading proxy response: %w", err) - } - - var response models.ProxyResponse - if err := sonic.ConfigFastest.Unmarshal(body, &response); err != nil { - return nil, fmt.Errorf("error parsing proxy response: %w", err) - } - - resp := &http.Response{ - StatusCode: response.StatusCode, - Status: fmt.Sprintf("%d %s", response.StatusCode, http.StatusText(response.StatusCode)), - Body: io.NopCloser(bytes.NewBufferString(response.Text)), - Header: make(http.Header), - Request: originalReq, - } - - parsedResponseURL, err := url.Parse(response.URL) - if err != nil { - return nil, fmt.Errorf("error parsing response URL: %w", err) - } - resp.Request.URL = parsedResponseURL - - for name, value := range response.Headers { - resp.Header.Set(name, value) - } - - for _, cookie := range response.Cookies { - resp.Header.Add("Set-Cookie", cookie) - } - - return resp, nil -}