Summary
Research and develop a machine learning model that encodes the industry that a client firm is working in.
Business context
Industry, or sector, is one of the most prominent features of firms, used to group them for business analytical purposes (reports, dashboards), and used in (machine learning) models for predictions. Industry is also among the better-known features of our clients; we have a so called NAICS codes (North American Industry Classification System) assigned to close to 100% of our clients. However, for other firms that do not bank with ING we do not have an industry classification.
Project context
We would like to have a model that classifies the industries that a firm operates in.
Who a firm’s buyers and suppliers are, is largely dictated by the industry that the firm operates in. A hotel is unlikely to receive large sums from a bakery. At ING we see our clients’ buyers and suppliers, but we only know the buyers’ and suppliers’ industries if they in turn they are ING clients. We thus have only partial information of these business partners, typically more complete for the smaller businesses, at least within NL.
The model should consider the industries of the buyers and suppliers, and potentially how much is payed to/from them and infer the industry of the firm. It can be trained on our own clients, for which we know the industry, and then applied to external firms to estimate their industries.
Where classical machine learning problems have a fixed set of features as input, this case (initially) does not: every firm has a different number of buyers and suppliers. Furthermore, there is no order in these; one supplier does not “go before” another. The model needs to be able to deal with this. An easy solution is to embed the buyer and supplier industries into a TF-IDF type vector, but other solutions may be out there.
There are degrees of being wrong. If a firm is a pig farm, but the model classifies it as a sheep farm, then the model is less far off than when it classifies it as an electric power generation company. NAICS codes luckily contain a hierarchy that can be used to assess how far off the model is: the first two digits give the sector, the third digit the subsector, the fourth the industry group, etc. up to six digits.
Research tasks
Research goals