Metadata-Version: 2.4
Name: abstract_webtools
Version: 0.1.6.426
Summary: Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content — with managed sessions, TLS/user-agent control, and a single shared request pipeline.
Home-page: https://github.com/AbstractEndeavors/abstract_webtools
Author: putkoff
Author-email: partners@abstractendeavors.com
Classifier: Development Status :: 3 - Alpha
Classifier: Intended Audience :: Developers
Classifier: License :: OSI Approved :: MIT License
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.11
Requires-Python: >=3.8
Description-Content-Type: text/markdown
Requires-Dist: requests>=2.31.0
Requires-Dist: urllib3>=2.0.4
Requires-Dist: beautifulsoup4>=4.12.0
Requires-Dist: opencv-python
Requires-Dist: moviepy==1.0.3
Requires-Dist: SpeechRecognition
Requires-Dist: pydub
Requires-Dist: selenium
Requires-Dist: playwright
Requires-Dist: m3u8_To_MP4
Requires-Dist: m3u8
Requires-Dist: wordsegment
Requires-Dist: pika
Provides-Extra: gui
Requires-Dist: PySimpleGUI>=4.60.5; extra == "gui"
Requires-Dist: PyQt5>=5.15.0; extra == "gui"
Provides-Extra: drivers
Requires-Dist: selenium>=4.15.2; extra == "drivers"
Requires-Dist: webdriver-manager>=4.0.0; extra == "drivers"
Provides-Extra: media
Requires-Dist: yt-dlp>=2024.4.9; extra == "media"
Requires-Dist: m3u8>=4.0.0; extra == "media"
Dynamic: author
Dynamic: author-email
Dynamic: classifier
Dynamic: description
Dynamic: description-content-type
Dynamic: home-page
Dynamic: provides-extra
Dynamic: requires-dist
Dynamic: requires-python
Dynamic: summary

# Abstract WebTools

**Composable, manager-based utilities for fetching, parsing, crawling, and mirroring web content.**

Abstract WebTools wraps the messy parts of web access — HTTP sessions, TLS/cipher
configuration, user‑agent rotation, retries, HTML parsing, link extraction,
crawling, headless browsers, and media downloading — behind a set of small,
composable **managers**. The managers share a single `URL → request → soup`
pipeline, so a page is fetched **once** and reused everywhere downstream instead
of being re‑fetched by every layer.

- **Author:** putkoff ([Abstract Endeavors](https://github.com/AbstractEndeavors))
- **Source:** https://github.com/AbstractEndeavors/abstract_webtools
- **Python:** 3.8+
- **License:** MIT

---

## Table of contents

- [Why](#why)
- [Install](#install)
- [Quick start](#quick-start)
- [Architecture: the manager chain](#architecture-the-manager-chain)
- [The managers](#the-managers)
- [Common recipes](#common-recipes)
  - [Get a page's source / soup](#get-a-pages-source--soup)
  - [Extract links](#extract-links)
  - [Crawl a site](#crawl-a-site)
  - [One shared context: `UnifiedWebManager`](#one-shared-context-unifiedwebmanager)
  - [A managed `requests.Session`](#a-managed-requestssession)
  - [Mirror an entire site (`usurpManager`)](#mirror-an-entire-site-usurpmanager)
  - [Download video / media](#download-video--media)
- [Design notes](#design-notes)
- [Testing](#testing)
- [Contributing](#contributing)

---

## Why

Most scraping code re‑implements the same plumbing on every project: building a
session, picking a user agent, tuning TLS so a server doesn't reject you,
handling retries, parsing HTML, then doing it all again for the next step.

Abstract WebTools factors each concern into a manager and **threads shared
instances through the chain**. Pass an existing `req_mgr` (or source code) into
any higher‑level manager and it is reused as‑is — no rebuild, no second network
request.

---

## Install

```bash
pip install abstract_webtools
```

Optional extras:

```bash
pip install "abstract_webtools[drivers]"   # selenium + webdriver-manager
pip install "abstract_webtools[media]"     # yt-dlp + m3u8 for video downloads
pip install "abstract_webtools[gui]"       # PyQt/PySimpleGUI helpers
```

Core runtime deps: `requests`, `urllib3`, `beautifulsoup4`. Browser and media
features pull in `selenium` / `playwright` / `yt-dlp` as needed.

---

## Quick start

```python
from abstract_webtools import get_soup, get_source, linkManager

# Fetch + parse a page (one request, reused internally)
soup = get_soup("https://example.com")
print(soup.title.text)

# Just the raw HTML
html = get_source("https://example.com")

# All links + image links on the page
lm = linkManager("https://example.com")
print(lm.all_desired_links)
print(lm.all_desired_image_links)
```

---

## Architecture: the manager chain

The core managers form a layered pipeline. Each layer accepts the layer(s)
below it and **reuses them when provided**:

```
urlManager        normalize / validate / vary URLs
   └─ requestManager   sessions, retries, TLS, UA  ── networkManager ┐
        └─ soupManager       BeautifulSoup parsing                   ├─ userAgentManager
             ├─ linkManager       link / image extraction            ├─ cipherManager
             └─ crawlManager       site crawling / sitemaps          └─ sslManager + tlsAdapter
```

Every layer has a matching factory function that detects and reuses an existing
instance:

| Factory | Returns | Reuses when given |
|---|---|---|
| `get_url_mgr(url=, url_mgr=)` | `urlManager` | `url_mgr` |
| `get_req_mgr(url=, url_mgr=, source_code=, req_mgr=)` | `requestManager` | `req_mgr` |
| `get_source(...)` | HTML string | `source_code` / `req_mgr` |
| `get_soup_mgr(...)` | `soupManager` | `soup_mgr` / `req_mgr` |
| `get_soup(...)` | `BeautifulSoup` | `soup` / `soup_mgr` / `source_code` |
| `get_crawl_mgr(...)` | `crawlManager` | `req_mgr` / `url_mgr` |
| `get_managed_session(...)` | `requests.Session` | `req_mgr` |

Because every factory short‑circuits on an instance you pass in, the whole chain
is built once and shared:

```python
from abstract_webtools import get_req_mgr, get_soup_mgr, linkManager

req = get_req_mgr("https://example.com")        # fetches once
soup_mgr = get_soup_mgr(req_mgr=req)            # no re-fetch
links = linkManager(req_mgr=req)                # no re-fetch
```

---

## The managers

| Manager | Responsibility |
|---|---|
| **urlManager** | Parse, validate, normalize and generate URL variants. |
| **requestManager** | `requests.Session` with retries, timeouts, TLS adapter, UA, proxies, cookies; optional Selenium fallback. |
| **networkManager** | Mounts the TLS adapter and wires proxies/cookies/UA into the session. |
| **userAgentManager** | Realistic user agents and per‑URL headers (random or pinned by OS/browser). |
| **cipherManager** | Cipher‑suite strings for TLS. |
| **sslManager** / **tlsAdapter** | SSL context + `HTTPAdapter` for fine‑grained TLS control. |
| **soupManager** | BeautifulSoup parsing, meta/link extraction, attribute discovery. |
| **linkManager** | Internal/image link extraction with desired/undesired filters. |
| **crawlManager** | Recursive crawling, sitemap generation, domain link discovery. |
| **middleManager** | `UnifiedWebManager` — one lazy facade over the whole chain. |
| **usurpManager** | Full‑site mirror: pages + assets + styles, references rewritten for offline use. |
| **videoDownloader** | Video/media download via `yt-dlp` / `m3u8`, wired to the managed session/UA. |
| **seleneumManager** / **playwriteManager** | Headless‑browser source fetching for JS‑rendered pages. |

---

## Common recipes

### Get a page's source / soup

```python
from abstract_webtools import get_source, get_soup, get_soup_mgr

html = get_source("https://example.com")
soup = get_soup("https://example.com")

# Reuse already-fetched HTML — no network call
soup2 = get_soup(source_code=html)

# Soup manager exposes parsing helpers
sm = get_soup_mgr("https://example.com")
print(sm.get_all_attribute_values(tags_list=["a", "img"]))
```

### Extract links

```python
from abstract_webtools import linkManager

lm = linkManager(
    "https://example.com",
    link_attr_value_desired=["/blog/"],      # keep only links containing this
    image_link_tags="img",
)
print(lm.all_desired_links)
print(lm.find_all_domain())                  # unique domains found
```

### Crawl a site

```python
from abstract_webtools import get_crawl_mgr, get_domain_crawl

crawl = get_crawl_mgr("https://example.com")
domain_links = get_domain_crawl("https://example.com", max_depth=3)
```

### One shared context: `UnifiedWebManager`

`UnifiedWebManager` lazily builds and caches `url_mgr`, `req_mgr`, `source_code`,
`soup_mgr`, `soup`, plus `link_mgr` / `crawl_mgr` — all over a single fetch.

```python
from abstract_webtools import UnifiedWebManager

web = UnifiedWebManager("https://example.com")
web.url_mgr      # built on demand
web.source_code  # fetched once
web.soup         # parsed once
web.link_mgr     # shares the same chain — no re-fetch
web.crawl_mgr

# Or start from HTML you already have (zero network):
web = UnifiedWebManager(source_code="<html>...</html>")
web.soup.title
```

### A managed `requests.Session`

Need a plain session, but configured with a real user agent, ciphers, the TLS
adapter and proxies? Ask the stack for one — it never fetches just to build it,
and reuses an existing `req_mgr`'s session when given:

```python
from abstract_webtools import get_managed_session

session = get_managed_session(user_agent="MyBot/1.0")
resp = session.get("https://example.com")
```

### Mirror an entire site (`usurpManager`)

`usurpManager` saves a working offline copy of a site — **pages and styles
intact**. By **default it recursively captures the whole site**: every
same‑domain page link and all referenced media. It follows CSS `url(...)` /
`@import` (including `@font-face` and cross‑domain CDN fonts), handles `srcset`,
inline `style=""` and `<style>` blocks, downloads scripts/images/linked files,
and rewrites every reference to a relative local path so the result renders
straight from `file://`.

```python
from abstract_webtools import usurpit

# Full recursive capture of the entire site (unlimited depth by default):
result = usurpit("https://example.com", output_dir="example_mirror")
print(result["output_dir"], len(result["pages"]), "pages")
```

Or drive it directly for more control:

```python
from abstract_webtools import usurpManager, get_req_mgr

req = get_req_mgr("https://example.com")
site = usurpManager(
    "https://example.com",
    req_mgr=req,                      # reuse the managed session
    output_dir="example_mirror",
    max_depth=None,                   # default: unlimited (whole site); set an int to cap
    mirror_external_assets=True,      # pull CDN css/fonts so styles work (default)
)
summary = site.main()
```

- The crawl is breadth‑first and **unlimited‑depth by default** (`max_depth=None`);
  the visited‑set keeps it finite/loop‑free. Pass an integer `max_depth` to bound it.
- Pages are mirrored within the origin host; referenced assets may come from
  CDNs (set `mirror_external_assets=False` to stay strictly on‑origin).
- A single `url → local path` map keeps references consistent and shared assets
  are fetched exactly once.
- For heavily JS‑rendered sites, fetch the rendered HTML first via
  `seleneumManager` / `playwriteManager`.

### Download video / media

```python
from abstract_webtools import get_video_info, downloadvideo

info = get_video_info("https://www.youtube.com/watch?v=...")   # metadata only
downloadvideo("https://www.youtube.com/watch?v=...", download_directory="videos")
```

The downloader pulls its user agent (and proxy) from the shared request stack
and threads them into `yt-dlp`, so downloads use the same identity as the rest
of your scrape. You can inject an existing `req_mgr` / `ua_mgr`:

```python
from abstract_webtools import VideoDownloader, get_req_mgr

req = get_req_mgr("https://example.com")
VideoDownloader(url="https://example.com/video.mp4", req_mgr=req,
                download_directory="videos")
```

---

## Design notes

- **Reuse over rebuild.** Every factory and constructor honors an instance you
  pass in. Supplying `source_code` or a `req_mgr` means **zero** extra network
  requests downstream.
- **One session, fully configured.** TLS ciphers, SSL context, the HTTP adapter,
  user agent, proxies and cookies are assembled once by the request stack and
  reused — including by `usurpManager` and `videoDownloader`.
- **Optional heavy deps stay optional.** Browser/media/GUI extras are imported
  defensively so the core package imports without them.

---

## Testing

The repo ships dependency‑light regression tests (only `requests` +
`beautifulsoup4` required) that load the real modules under a controlled
namespace and assert the no‑refetch behavior and the site mirror:

```bash
python tests/test_manager_chain.py        # url/request/soup/link chain reuse
python tests/test_video_usurp_chain.py    # managed session for video + usurp
python tests/test_usurp_mirror.py         # full-site mirror with styles
```

---

## Contributing

Issues and PRs welcome at
[AbstractEndeavors/abstract_webtools](https://github.com/AbstractEndeavors/abstract_webtools).
Please keep new functionality threaded through the shared manager chain (accept
and reuse `url_mgr` / `req_mgr` / `source_code`) rather than re‑fetching, and add
a dependency‑light test where practical.
