PLEASE NOTE: While the first half of my OSINT release was more geared toward hacking, Red Teaming and penetration testing, I have also endeavored to make this release more valuable to others who utilize OSINT.
If while reading this you decide it is not applicable to your particular needs where OSINT is concerned, then please jump to the last section.
At the very least, I can provide you with new resources that allow you to find almost anything that exists on the Internet.
I have had many opportunities to use OSINT in an investigative manner closer to the usecases present in other professions/pursuits; this has mostly occurred on occasions where I have worked the defensive side of things.
While the content herein may be flavored with some offensive security terms, I have endeavored to make sure the content is (at worst) broadly applicable with a little work.
For instance, this first section uses words/images to detail how I once used a web developers CMS as a means of OSINT.
*This CMS (and all I detail concerning it) could have been found simply (and in a way that I don’t believe to be implicitly illegal) with a web browser via Search Engine dorking against any of the employee email addresses that were present.
While an investigator may not have accessed the emails present for legal/ethical reasons, if you were investigating one of these employees, could some of these things that were in plain sight on different pages be helpful to your investigation:
1. Access to months worth of your target’s past/present/future work schedule; for example, their work days/hours on or off and the physical work location they would be at.
2. Employee information: position at the company and contact information such as multiple work phone numbers, multiple personal phone numbers, personal email addresses and other business email addresses.
3. Months worth of their past/present/future activities/scheduling via their business calender.
4. Similar information belonging to their co-workers.
And finally, the contents of this release uses one idea to introduce many others; while one of the main concepts we will cover surrounds what I call “archives”, this writing will review many OSINT related sources, techniques, thoughts and tools that can be a means to an end by themselves.
One concept may be the trunk of this tree, but it is not its roots, leaves, fruit or branches.
The Kitchen Sink
During my career as a professional hacker (which encompasses penetration testing, Red Teaming and Bug Hunting), most of the targets I have found myself engaging have consisted of companies and organizations.
Through targeting these entities, I often end up targeting individuals as a means to attack a company or organization.
In my experience, a huge portion of the OSINT that can be found on the Internet originates from companies/organizations trying to generate, maintain and facilitate commerce; the processes and materials involved in doing so manifest themselves as forms/methods of communicating that produce OSINT.
The best type of information you can gain to exploit (or investigate) a target is almost always going to be the information a target provides about themselves (especially an unsuspecting target).
However, often much of the most visible information a target puts on the Internet is heavily subjective, especially if that information is closely related to things like commerce (money).
Or, at the very least, the hint/involvement of things like commerce should be a concern when analyzing/reviewing OSINT.
Having to deal with material that may be heavily subjective can lessen the value OSINT can provide during an engagement/operation (or investigation): the presence of subjectivity further burdens processes of analysis/review.
As we covered in Part One, OSINT already tends to require greater logistical resources then other forms of reconnaissance/enumeration utilized during an engagement/operation, especially since most engagements/operations won’t have the resources/logistics available that most processes for utilizing/processing OSINT assume.
There is a way to circumvent many of the issues that a large percentage of OSINT may present while maximizing the advantages/resources it can provide: by targeting what I call “archives”.
To put it another way: we want to skip talking to the illusion the Wizard of OZ used to represent himself and jump right to pulling back the curtain, thereby more quickly/efficiently revealing the truth of the man hidden behind it.
Archives are large caches of data that can be publicly accessed on the Internet and tend to much more objective in nature but no less (and often more) information rich; they tend to a wealth of material that target’s are willing to make publicly accessible, though they do not want them overly visible.
Or perhaps the target has forgotten about the archive or access to the archive…
Archives are often maintained by the target (or sources that can be tied to the target) for internal use by employees or customers/clients, thereby circumventing much of the hyperbole and subjective nuances that can reduce the value of OSINT by adding time/energy to the processes necessary to convert it to relevant, actionable/applicable intelligence.
Better yet, searching for archives often uncovers other rich sources of OSINT other than archives.
What I call archives can exist in all number of forms; what they have in common is that they are publicly accessible and contain multiple pages/directories of content.
Often, target archives sit on some out of the way subdomain in order to assure ease of access for internal/in-house employees and assure that customers/clients and/or outsourced help can access them.
Sometimes archives are found in the form of open web directories or web servers of every type (such as publicly accessible FTP directories/servers).
Or web pages/directories with content that was not meant to be publicly accessible, but is, because it was rendered in another form (.XMLs are outstanding in this regard).
An archive could be made up of many documents on a target’s webpages that they thought to hide via complex directory/file names.
Sometimes these archives are comprised of logs created by messaging solutions (IRC, Slack, etc.) used by developer/engineer teams and/or something like commit histories (usually with a fair amount of conversation/debate) used by international teams of developers/engineers.
And as was the case in the archive shown in Part 1, these caches of material have often broken engagements or some phase of an engagement wide open for me…
The employees responsible for the archives and/or storing the material within them often did or do not have Information Security training or awareness; thus, it is not unusual to find archives that contain material perfect for building a social engineering campaign (internal email/document templates, contact information not otherwise available publicly, employee/customer/client personal identification data, etc.), PAC files, certificates, material containing valid credentials and/or providing insider knowledge of the configurations of network hosts/appliances/defenses…
An example of an unconventional archive
During an external penetration testing engagement, a web development company doing work on the target’s IP space made some mistakes that allowed me to access an OSINT rich archive.
The archive was found via manual searching the target’s IP space with a technique that will be discussed a bit later in this release.
Configurations that were put in place by the web development company (for the convenience of the web development company) rendered everything seen in the images within this section publicly accessible with Administrator rights, including all of the resources that comprised the web dev provider’s Content Management System (CMS).
This archive provided considerable data that could have been used against either of the companies or any of the employee’s involved in this project, as I had full view of the day to day operations involved in the project the CMS was established to undertake (employee calenders which included scheduling, emails, ticketing system, tasks underway/completed, etc.).
Quite a bit of OSINT relevant to the web development company, it’s employees and our mutual client could be downloaded via the Media Center option or otherwise accessed via the “Manage Files” option; the contents of either could have had valuable metadata stripped from the content present there.
Further testing showed that this archive could have been found through other means, like Search Engine Dorking vs. email addresses belonging to web development company employees that were included in certain sections of the CMS.
This yielded quite a bit of OSINT that could have been used against our mutual client and the web development company itself…
I could read some of the web dev provider’s internal emails to/from our mutual client and between the web development company’s employees who were working on this project.
Also, I could access employee information belonging to all 11 of the web development company’s employees (which included some private/personal information for 7 of the employees via the “View Profiles” tab) working on this project, including the Sales staff involved.
The CMS also allowed me to access completed email and invoice templates the web development company had been contracted to create for our mutual client.
Along with the other data the CMS provided, these templates would have been the perfect finishing touch to add extra potency to social engineering attacks toward exploitation or fraud vs. either of the companies concerned.
As I mentioned in Part 1 of this release, by accident or on purpose, the majority of companies I have engaged had some manner of OSINT rich archives sitting somewhere on the Internet, publicly available.
Most of the attacks I conduct have a limited duration, which means I cannot spend forever looking for just these archives.
Fortunately, I have developed some processes to locate these and other sources of valuable intelligence as I do other things throughout all phases of an engagement, even outside of prescribed/formal reconnaissance/enumeration phases.
As early as possible during an engagement, I run multiple automated tools against the target’s root domain, as these require no extra attention on my part and allow me to undertake more manual methods/means of reconnaissance/enumeration.
Image above: Pagodo automated dorking, Datasploit and Spiderfoot all running on the same machine during an engagement. As the resources necessary to run these separately or individually are not overly taxing on a machine (I run Spiderfoot continually throughout an engagement, often in combinations with other automated tools such as Pagodo and Pymeta, using a 512 ram Dell Dimension Desktop from 2004), multiple instances can be deployed fairly quickly if need be, yielding quite a bit of invaluable information concerning a target vs. very little logistical cost in comparison.
These tools include Spiderfoot, OWASP Amass, Pymeta and Pagoda vs a root domain, which help establish what I call “Passive Advantage”: these tools are enumerating the target without any extra energy/attention investment on my part past starting the tools (and depending on the engagement/operation, starting/establishing means to obfuscate these tools).
This allows me to work manual methods of reconnaissance/enumeration while exponentially furthering the pool of available information I can use toward exploiting the target throughout the engagement/operation.
I keep running tools that help me to establish Passive Advantage throughout the engagement; as we move past the reconnaissance/enumeration proper phases, I use these tools to refine/isolate targets/possible targets of interest while I work my manual techniques.
For example, the reconnaissance/enumeration phases proper may see Passive Advantage tools run against a root domain, then run against a subdomain that is found as I manually search other domains/subdomains, then these tools may be run against an email address or document/hash/filename found on the subdomain as I search it manually, etc.
We will assume/imagine these tools for establishing Passive Advantage are running while we will start out covering manual enumeration of Archives first.
Manual Enumeration of Archives
Hunter.io: an OSINT source & a source for finding OSINT
While I have all manner of automated reconnaissance/enumeration tools running, one of the first resources I often use to begin manual enumeration of a target is Hunter.io, which keeps a huge archive of over 1.5 million email addresses, many of them internal/corporate email addresses; even a free account can give you access to hundreds of a target’s emails.
Corporate/internal email addresses have a huge quantity of value; for instance, just at base value, corporate/internal email addresses often double as the user/username component of a set of credentials…
Above: Hunter.io lets you search for companies with an autofill/autosearch like capacity and will enumerate their email address naming conventions (if they find they have the target’s name incorporated in this option).
Below: The target didn’t appear via the automatic search option, so I ran the target’s root domain name in Hunter.io’s search bar, which enumerated 434 of the target’s internal/corporate email addresses with a free account.
Below: The real value of Hunter.io is not in the collection of e-mail addresses; tools like Harvester do that as well, though it is a bit more labor intensive (and you may find value in running tools like that as well as using Hunter.io).
I find the real value of Hunter.io rest in it’s quantification and indexing of these addresses.
For instance, Hunter.io has broken down the most common naming convention used by the target for all email addresses.
The email addresses listed for a company/organization are usually divided under different generic departmental classifications; an individual employee’s placement under each classification are usually based off the individual employee position/job title Hunter.io has enumerated (which from my experience, tend to be accurate).
These classifications can seriously help with target designation and further targeted recon/enumeration; for instance, the image above has fields like “Communication”, “Marketing”, “Human Resources”, etc.
Below: Each email has the target employee’s full name, job title, a designation whether Hunter.io has verified the e-mail or not (green check and shield).
Below: Most importantly a “Sources” tab showing where the e-mail address was located (and possibly verified).
This is exceptionally useful as it often provides many instances of OSINT and other domains/subdomains that could be sources for OSINT, some of which include blogs this employee has written (and likely the location of other blogs written by the target company’s other employees).
For example, this particular employee had blog entries and articles concerning them on sites belonging to the target company and a university, both of which were dated in the same years.
The dates are helpful for quantifying many things: for instance, it can help estimate how long the employee has been working for the company or how long they have been involved with other organizations relating to their profession; these dates can also help establish the target’s importance/stature within both companies and/or organizations which they are involved with.
The “Sources” tab sometimes links to personal pages belonging to employees as well.
Below: The instances marked “removed” can lead to especially valuable instances/resources of OSINT, as they sometimes denote sensitive or problematic resources that the company or employee have attempted to remove; these could possibly be found by resources like the Wayback Machine.
They also sometimes lead to subdomains the individual’s employer have abandoned but that are still Internet accessible; this sometimes happens because the individual was employed by an entity acquired by the target company.
The company also acquires the domain space of that entity and incorporates it as a subdomain under their own IP space, but does not maintain it…this can present a resource containing sensitive data and/or vulnerable infrastructure.
I will often start looking for archives using Hunter.io’s classification scheme to locate the IT/Developer/Engineer (technology) based employees and look through the instances listed under the “Sources” tab.
My techniques for attempting to discover archives generally start with attempts to leverage aspects/facets of the target’s technical/technological resources.
This usually starts with my leveraging aspects of the target’s technological/technical employees: applying search techniques or web based resources that leverage email addresses, names, unique/unusual job positions or titles belonging/related to positions encompassing Developers, Engineers and IT.
If that fails, my search techniques and/or resources will leverage technology/technical related words/terms that seem unique to the target: examples include hardcoded/default username/password/credential sets I haven’t seen before, the name of some technical component, a serial number, a hardcoded IP and/or MAC address.
There are many reasons for this, but the main two are:
- Most of the company/organization archives I have found contain (or were established to store) some quantity of technical/technological resources…finding these archives often means utilizing Search Engines for things like Dorking or the Search functionality of various web based resources.
Utilizing searches that leverage technological/technical based terms/words/employees raises the probability of finding archives or other worthwhile resources (the second reason is very much entwined with this probability).
- Searching against a target’s technological/technical resources (that aren’t things like a popular product or an employee with a high public profile) often find fewer, but much more relevant, indexed Search Browser results.
Websites as Archives
Sometimes, targets have material on their websites that they try and place in out of the way directories, often using complex naming conventions.
Or, the content may not have been placed with any special degree of cunning, but the web presence of that company/organization is spread across multiple sites/domains/subdomains.
Also, if a company/organization is your target or an individual you are targeting has a special attachment/relationship with a company/organization, then it is worth searching the website(s) of that entity, but you may want to gain an overview of the content of each site to gauge whether it is worth your time to search, or how much time is worth investing in searching.
These techniques accomplish all of those goals and can yield a solid quantity/quality of OSINT that may not be one solitary cache of material.
As shown in the example in the first section detailing the web developer’s CMS, one of the tools I utilize for these tasks is Firefox Browser equipped with the Link Gopher extension.
Link Gopher will show you all of the links present on a webpage, most of the links available on a website and all of the sites connected by links to the website/webpage.
When I say “links”, I mean that Link Gopher will show you the URL for, and allow you to access, the media (PDFs, JPEG, PNGs, XLS, TXT, etc.), other web pages/websites, domains/subdomains and other content on a webpage/website.
In the past, I used DownthemAll and Foxyspider with Link Gopher in Firefox, but they only work on older versions of Firefox unless you work some code; I have always found their results to be so similar that I do not usually bother getting DownthemAll or Foxyspider operational anymore, just Link Gopher as it works with both the newest versions of Firefox and slightly older instances of Firefox ESR.
Image Above: Searching a target company/organization website with Link Gopher leads to a another website belonging to the target company/organization where almost 1400 media files existed.
Many of these were company business documents/presentations/internal documents and many instances of what amounted to employee resumes: PDF’s (note names like Stephen, Peter, Lilian, etc.) that were meant to introduce an employee to a prospective client/company and establish their expertise.
These PDFs including things like extensive bios, contact information, personal anecdotes, professional experience, educational experience and interests.
The Link Gopher image above led directly to the links in the image below via a URL to another webpage.
Throughout the target company’s websites, webpages presented similar extensive background information about employees of every type throughout the company.
These links lead to full or partial biographies and career highlights teaming with OSINT, all of which include information similar to what appears in the image below, some of which are surrounded in the thick red rectangle (these webpages also included images of the employee).
Image Below: Other webpages contained thousands of documents covering all manner of the target company’s internal business matters; notice the randomized numbers, whereas trying to access “https://www.targetcompany.net/document/” directory itself only showed a 404 error message.
In other words, the full URL would have to be known and/or accessed via a fully, correctly formed link to get access to these documents.
Image Below: Accessing a website/webpage full of the target company’s media, notice the Social Media URLs with the red redactions; these represent social media accounts belonging to the target.
Link Gopher does not repeat the same links; thus, repeat links are different accounts, with the red redactions in these instances likely representing different usernames used by the target for their Social Media.
Web Based Resources useful in locating Archives and other sources/instances of OSINT
URLscan.io (https://urlscan.io/) is an invaluable tool for gathering OSINT; the two most basic functions it provides are:
-
A scanning option that can be run against web-based resources to generate an exhaustive analysis of the target; this analysis is almost unparalled in it’s thoroughness and is easy to interactive with/navigate.
-
The ability to search all past scans/analysis the site and it’s users have generated, which (if memory serves me), numbers into the millions.
Where gathering OSINT is concerned, this option to search against the site’s past results tends to gain a slight edge in the value department; this is due to the extra versatility this functionality allows.
Whereas generating new analysis with URLscan.io tends to be restricted to scanning URLs or domain names, the site’s huge archive of historical analysis can be searched against domains, IPs, filenames, hashes, or ASNs.
Perhaps you’ve found files belonging to the target elsewhere on the Internet; the names of these files appear to consist of random characters in no particular order, with file names much more closely resembling 3114xz7!.PDF then Apache.PDF.
So you search all past analysis generated by URL.io’s many users against the filenames with the seemingly random ordering/collection of characters (lthose more closely resembling 3114xz7!.PDF).
I have found plenty of target resources utilizing URLscan.io’s “Search” option (image below within the red square) with this and similar strategies; these resources have included many variety of archives & BlackHat infrastructure (whether rented or exploited) .
Searching in this way has caused URLscan.io to provide me with access to a target’s email server to before…
Image Below: I utilize the “Search” option (surrounded in the red box) against a PDF file that was named with a random string of numbers, resulting in 29 different matches to past scans/analysis having been found (one of them is shown, redacted).
The first page is an extensive overview of identification data surrounding the URL in question, with the data here usually representing the entire website; the page includes any additional scans of this URL/website and an extensive listing of stored scans of websites that seem similar to this (you will often find stored scans for other websites/domains/subdomains belonging to the same target here).
Much of the OSINT I will discuss now would be especially helpful for use during triage/remediation of a security incident (perhaps you just search the URL or domain name in this instance).
Image Below: The “Submission” establishes when this scan occurred and what country of origin made the request (though obfuscation may be in play)…should information herein be helpful in discovering a source of compromise, this date/time will be helpful in establishing a timeframe of occurrences/events.
Image Below: We see more identification information for the website/domain in question under “Live Information”.
“Domain & IP Information” could be especially helpful in quantifying possible sources of an attack during incident response, as it thoroughly documents/defines every IP that URLscan.io has detected connecting to this IP or issuing a redirect.
This includes the number of redirects/connects detected and considerable information fully defining those hosts that can be viewed through multiple filters: IP Details, Subdomains, Domain Tree, Links, Certification (plus other information that is shared once one of these filters is chosen).
Remember that URLscan.io could be used to investigate any of the IP connecting to this host, which could also allow you to investigate other hosts connecting to that host…combined with other data available here (such as the data under the HTTP tab), URLscan.io could allow investigators to defeat attempts at obfuscation and help ensure more accurate attribution.
“Stats” section provides various connection related data, which could be useful in more quickly deducing that there is an issue or what the issue could be (example: if this was a host with an extremely limited usecase within your network, hundreds of requests in a short period could be an issue worth investigating).
“Detected Technologies” could make you aware of technology that has been added to a host without you or your team’s knowledge; if surrounding hosts have been compromised and this host now has analytics software common in traffic monetization installed, there could be a high probability this host has also been exploited.
Image Below: HTTP Transactions allows you to look at the actual code that was involved in most of the HTTP traffic (including redirects) that URLscan detected this website/domain/host participating in.
In the past, this has allowed me to locate malware that was in play but had yet to be discovered or suspected.
Since the IP/network information surrounding every host involved is so well defined, the HTTP Transactions tab has allowed me to track down redirecting malware and/or malware calling it’s components from multiple other sites ( this tab has also helped me detect credentials that were being moved through a host on route to another host thanks to malware present).
Since these are real time captures of actual traffic that occurred, I have been able to find credentials of many sorts via the “Show response” option (tokens, usernames/passwords in transit, etc.); this tab has also helped me to identify vulnerabilities present on a host or the surrounding network.
Image Above: The “Links” tab shows all of the links present on this website, including the target’s Social Media links with what are likely usernames and post/session identifiers, both redacted in red.
The “http://resources” links that are partially redacted in black led to another website/domain/subdomain that contained media meant for internal company and customer/client communications/distribution.
There is much more that URL.scan.io can be utilized for OSINT wise, such as locating/investigating hashes that may be tied to the compromise of a host under the “IOC” tab, or the investigation of web based content under the “DOM”, “Content” and “API” tabs (these regularly provide the same depth of information (or more) than HTTP Transactions do: malware, credentials, vulnerabilities, etc.).
The Last Section
This release has already gone overlong and I have excepted that covering my personal/favored OSINT techniques cannot be done in two volumes; it is going to have to occur over time, over multiple releases.
So lets end this with a few more sources in rapid order…those which have been my secret weapon for years…
I had seen Mamont’s Open FTP Index at http://www.mmnt.net// mentioned in a single Reddit post on the entirety of the Internet before I tweeted about it last December.
You can gain a ton of OSINT from open, publicly accessible FTP sites; if a target has an open FTP site with accessible content, it will be listed here (among the many other tens or hundreds of thousands listed).
Mamont’s can be searched multiple ways: file names, domain names, etc…
And here are all the resources you will ever need to find archives online in the form of open, publicly accessible web servers:
There is a small sub-Reddit where the absolute kings of finding publicly accessible web directories hang out; I gift you r/opendirectories at https://www.reddit.com/r/opendirectories/…
I suggest you spend time at r/opendirectory regularly (while maintaining as low a profile as possible) and learn their ways; the members there often find stuff that was meant to be so private that they joke about ensuring it never leaves the sub-Reddit (and of course, curiosity causes mass verification of these sources, which are almost always as true as they are startling); they also regularly release new sources and tools.
Guides/sources from r/opendirectory that are worth their weight in gold:
"All resources I know related to open directories"
"Googling Open Directories"
How To Find (Almost) Anything You Want On An Open Directory Page
How do I find open directories?
How are all of you finding these directories?
/how_are_all_of_you_finding_these_directories/
How do you search for open directories?
The Google Open Directory Search Engine
File Chef: Get Direct Download Links for almost anything.
Another small sub-Reddit where the users have considerable skills/resources where getting the data they want is concerned.