lap. One model corresponds to one violated topic. One-vs-all
models can also be easily re-trained and deployed indepen-
dently. We made a pipeline using Docker [9] container-based
workloads on a Kubernetes [6] cluster for model training
workloads. We write manifest files containing requirements
like CPU, GPU and Storage which are deployed using this
pipeline.
For ML algorithms, we used Gated Multimodal Units
(GMU) [5] and Gradient Boosted Decision Trees (GBDT) [7].
GMU potentially provides the most accuracy using multi-
modal data. GBDT is efficient to train and use for prediction
when training dataset size is not large. We train the GMU
models using PyTorch [10] and deploy them using PyTorch
to ONNX [4] to Caffe2 [8] conversion.
The system then automatically evaluates the new model
against the current model in production using offline evalua-
tion (Sec. 2.2).
2.2 Evaluation
We propose offline and online evaluation to avoid concept
drift [11].
Offline evaluation.
We use the precision@
K
of the model
as our evaluation metric since it directly contributes to the
moderator’s productivity, where
K
is the bound on the number
of alerts in each violated topic and is defined by the moder-
ation team. We evaluate the new model based on back-tests
(the current model’s output on test data is known). The back-
tests guarantee that the new model is not worse than the cur-
rent model which prevents concept drift. However, this test
is biased towards the current model because the labels were
created based on it. Thus, we also evaluate online for a prede-
termined number of days by using both models for predictions
on all items.
Online evaluation.
In our scenario, A/B testing is slow for
decision making because the number of violations is much
lower than valid item listings. As a result, A/B testing can take
several months. This results in concept drift occurring and
does not meet our business requirements. For faster decision
making, we deploy the current and new models in production,
and both of them accept all traffic. We set the thresholds of
current and new models to alert half the target number each
in that violated topic. The current and new model send
K
2
alerts each to the moderator. If the new model has better
precision@
K
2
during online evaluation, we deprecate the old
one and expand the new model to the target number of alerts.
Table 1 shows the relative performance gain of the model
based on precision@
K
. It shows our back-test reflecting the
performance in production.
3 System Design
Figure 2 describes the system architecture. Our deployments
are managed using Horizontal Pod Autoscaler which helps to
Table 1: Percentage gains of GBDT and GMU compared to
Logistic Regression in offline and online evaluation on one
violated topic.
Algorithms Offline Online
GBDT +18.2 % Not Released
GMU +21.2% +23.2%
Subscribe
Publish
proxy
Preprocessing
Container
inference
Container
Preprocessing+inference
Container
Caffe2model
scikit-lean+GBDT
violated
topicA
violated
topicN
Message
queue
Message
queue
predictionlayer
proxylayer
scikit-learn
...
Figure 2: Auto content moderation system architecture.
GBDT
: One container contains the preprocessing and infer-
ences.
GMU
:GMU has two containers. i) preprocessing. ii)
inference using Caffe2.
maintain high availability and cut down production costs. The
system has a proxy layer which gets messages from a queue
and makes REST calls to the prediction layer. The prediction
layer is responsible for preprocessing and inference, and re-
turns a prediction result to the proxy layer. The proxy layer
aggregates the responses from all the models and publishes
messages for those items predicted as positive by at least one
model, to a different queue where these messages are then
picked up by a worker, and sent to the moderators for manual
review of items. In online and offline evaluation, the proxy
layer logs the predictions from all models and these logs are
exported to a Data Lake.
4 Conclusion
Content moderation in C2C e-Commerce is a very challenging
and interesting problem. It is also an essential part of services
providing content to customers. In this paper, we discussed
some of the challenges like new ML model introduction into
production and how to efficiently prevent concept drift based
on our experience. Our Auto Content Moderation system
successfully increased moderation coverage by
554.8
% over
a rule-based approach
Acknowledgments
The authors would like to express their gratitude to Abhishek
Vilas Munagekar and Yusuke Shido for their contribution to
this system and Dr. Antony Lam for his valuable feedback
about the paper.
34 2020 USENIX Conference on Operational Machine Learning USENIX Association